JP7368995B2

JP7368995B2 - Image recognition system, imaging device, recognition device, and image recognition method

Info

Publication number: JP7368995B2
Application number: JP2019179524A
Authority: JP
Inventors: 龍佑野坂; 高晴黒川; 秀紀氏家
Original assignee: Secom Co Ltd
Current assignee: Secom Co Ltd
Priority date: 2019-09-30
Filing date: 2019-09-30
Publication date: 2023-10-25
Anticipated expiration: 2039-09-30
Also published as: JP2021056785A

Description

本発明は、画像認識システム、撮像装置、認識装置及び画像認識方法に関する。 The present invention relates to an image recognition system, an imaging device, a recognition device, and an image recognition method.

近年、防犯意識の高まりから監視カメラの設置数が増加している。これに伴い、監視カメラ等の撮像装置によって撮像された画像を監視者が視認して不審者や不審物等の対象物を認識することが難しくなっている。そこで、このような画像に対して画像認識処理を実行し、対象物を自動的に認識する要求が高まっている。 In recent years, the number of surveillance cameras installed has increased due to heightened awareness of crime prevention. As a result, it has become difficult for a supervisor to visually recognize images captured by an imaging device such as a surveillance camera and to recognize objects such as suspicious persons or objects. Therefore, there is an increasing demand for automatically recognizing objects by performing image recognition processing on such images.

画像認識処理を実行するためには、撮像された画像を一時記憶又は／及び伝送する必要があり、多数の撮像装置によって撮像された画像を記憶又は／及び伝送するためには多くの記憶容量又は／及び伝送容量が要求される。したがって、画像認識の精度を保ちながら、画像認識処理を実行する対象の画像のデータ容量が抑えられることが好ましい。特許文献１には、撮影画像を区分した複数のブロックに含まれるエッジの強度に基づいて各ブロックのエッジレベルを推定し、推定されたエッジレベルに基づいて各ブロックを低解像度の画像に置換する監視カメラが開示されている。特許文献２には、太さが基準以下である微細エッジ及びその近傍の微細構造領域を検出し、検出された微細構造領域の外側を低解像度の画像に置換する監視カメラが開示されている。 In order to perform image recognition processing, it is necessary to temporarily store and/or transmit captured images, and in order to store and/or transmit images captured by multiple imaging devices, a large amount of storage capacity or /and transmission capacity is required. Therefore, it is preferable that the data volume of the image to be subjected to image recognition processing is suppressed while maintaining the accuracy of image recognition. Patent Document 1 discloses that the edge level of each block is estimated based on the strength of edges included in a plurality of blocks into which a captured image is divided, and each block is replaced with a low-resolution image based on the estimated edge level. Surveillance cameras have been disclosed. Patent Document 2 discloses a surveillance camera that detects a fine edge whose thickness is less than a standard and a fine structure region in the vicinity thereof, and replaces the outside of the detected fine structure region with a low-resolution image.

特開２０１５－０８８８１７号公報Japanese Patent Application Publication No. 2015-088817 特開２０１５－０８８８１８号公報Japanese Patent Application Publication No. 2015-088818

しかしながら、特許文献１及び２の手法では、画像のエッジの強度等に応じて置換後の画像のデータ容量が変動するため、画像認識処理のために必要となる記憶容量又は／及び伝送容量の予測が困難であるという問題があった。そこで、画像認識の精度を保ちつつ、画像認識の対象である画像のデータ容量を安定して低減させることが望まれている。 However, in the methods of Patent Documents 1 and 2, the data capacity of the replaced image changes depending on the edge strength of the image, etc., so it is difficult to predict the storage capacity and/or transmission capacity required for image recognition processing. The problem was that it was difficult. Therefore, it is desired to stably reduce the data volume of images that are objects of image recognition while maintaining the accuracy of image recognition.

本発明は、上述の課題を解決するためになされたものであり、画像認識の精度を保ちつつ、画像認識の対象である画像のデータ容量を安定して低減させることを可能とする画像認識システム、撮像装置、認識装置及び画像認識方法を提供することを目的とする。 The present invention has been made to solve the above-mentioned problems, and provides an image recognition system that makes it possible to stably reduce the data volume of images that are the object of image recognition while maintaining the accuracy of image recognition. The present invention aims to provide an imaging device, a recognition device, and an image recognition method.

本発明に係る画像認識システムは、所定の対象が撮像され得る空間を撮像して、第１の階調範囲内の階調値を有する画素からなる第１画像を生成する撮像手段と、第１画像が入力された場合に第１の階調範囲よりも小さい第２の階調範囲内の階調値を有する画素からなる第２画像を出力する変換器により、第１画像を第２画像に変換する変換手段と、第２画像が入力された場合に対象を認識するための処理を第２画像に対して行って認識結果を出力する認識器により、第２画像に対する認識結果を生成する認識手段と、を備えたことを特徴とする。 The image recognition system according to the present invention includes: an imaging means for imaging a space in which a predetermined object can be imaged, and generating a first image consisting of pixels having gradation values within a first gradation range; converting the first image into a second image by a converter that, when input with an image, outputs a second image consisting of pixels having tone values within a second tone range smaller than the first tone range; Recognition that generates recognition results for the second image using a conversion means that performs conversion and a recognizer that performs processing for recognizing a target on the second image when the second image is input and outputs the recognition results. It is characterized by having a means.

また、本発明に係る画像認識システムにおいて、変換器及び認識器は、変換器の出力が認識器の入力となるように結合されたニューラルネットワークに第１の階調範囲内の階調値を有する画素からなる学習用第１画像が入力された場合に出力される認識結果を学習用第１画像に対して出力されるべき認識結果として予め設定された学習用認識結果に近づけるように学習された学習済みニューラルネットワークである、ことが好ましい。 Further, in the image recognition system according to the present invention, the converter and the recognizer have a tone value within the first tone range in the neural network coupled so that the output of the converter becomes the input of the recognizer. The recognition result that is output when the first learning image consisting of pixels is input is trained to be close to the learning recognition result that is set in advance as the recognition result that should be output for the first learning image. Preferably, it is a trained neural network.

また、本発明に係る画像認識システムにおいて、変換器及び認識器は、結合されたニューラルネットワークに学習用第１画像を入力した場合に前記変換器によって出力される第２の階調範囲を有する画像を学習用第１画像から生成されるエッジ画像に近づけ、且つ、結合されたニューラルネットワークによって出力される認識結果を学習用認識結果に近づけるように学習された学習済みニューラルネットワークである、ことが好ましい。 Further, in the image recognition system according to the present invention, the converter and the recognizer are configured to generate an image having a second gradation range that is output by the converter when the first image for learning is input to the combined neural network. It is preferable that the neural network is a trained neural network that has been trained to bring the image closer to the edge image generated from the first image for training and to bring the recognition result output by the combined neural network closer to the recognition result for training. .

また、本発明に係る画像認識システムにおいて、変換された第２画像を所定の伝送網に出力する出力手段と、第２画像を伝送網から取得する取得手段と、をさらに備え、認識手段は、取得された第２画像に対する認識結果を生成する、ことが好ましい。 The image recognition system according to the present invention further includes an output means for outputting the converted second image to a predetermined transmission network, and an acquisition means for acquiring the second image from the transmission network, and the recognition means includes: Preferably, a recognition result is generated for the acquired second image.

本発明に係る撮像装置は、所定の対象が撮像され得る空間を撮像して、第１の階調範囲内の階調値を有する画素からなる第１画像を生成する撮像手段と、第１画像が入力された場合に第１の階調範囲よりも小さい第２の階調範囲内の階調値を有する画素からなる第２画像を出力する変換器により、第１画像を第２画像に変換する変換手段と、第２画像を出力する出力手段と、を備えたことを特徴とする。 An imaging device according to the present invention includes an imaging means for imaging a space in which a predetermined object can be imaged to generate a first image composed of pixels having a gradation value within a first gradation range; converting the first image into a second image by a converter that outputs a second image consisting of pixels having tone values within a second tone range smaller than the first tone range when The image forming apparatus is characterized by comprising a converting means for outputting the second image, and an output means for outputting the second image.

また、本発明に係る撮像装置において、変換器は、変換器の出力が、第２画像が入力された場合に対象を認識するための処理を第２画像に対して行って認識結果を出力する認識器の入力となるように結合されたニューラルネットワークに第１の階調範囲内の階調値を有する画素からなる学習用第１画像が入力された場合に出力される認識結果を学習用第１画像に対して出力されるべき認識結果として予め設定された学習用認識結果に近づけるように学習された学習済みニューラルネットワークである、ことが好ましい。 Further, in the imaging device according to the present invention, the converter performs processing for recognizing a target on the second image when the output of the converter is the second image and outputs a recognition result. When a first image for training consisting of pixels having tone values within the first tone range is input to a neural network connected to be input to the recognizer, the recognition result outputted as the first image for learning is Preferably, it is a trained neural network that has been trained so as to approach a training recognition result set in advance as a recognition result to be output for one image.

本発明に係る認識装置は、第１の階調範囲内の階調値を有する画素からなる第１画像が入力された場合に第１の階調範囲よりも小さい第２の階調範囲内の階調値を有する画素からなる第２画像を出力する変換器により、撮像により生成された第１画像を変換した第２画像を取得する取得手段と、第２画像が入力された場合に対象を認識するための処理を第２画像に対して行って認識結果を出力する認識器により、第２画像に対する認識結果を生成する認識手段と、を備えたことを特徴とする。 The recognition device according to the present invention is configured such that when a first image consisting of pixels having a gradation value within a first gradation range is input, pixels within a second gradation range smaller than the first gradation range are input. an acquisition unit that acquires a second image obtained by converting the first image generated by imaging using a converter that outputs a second image consisting of pixels having gradation values; The present invention is characterized by comprising a recognition unit that generates a recognition result for the second image using a recognizer that performs recognition processing on the second image and outputs the recognition result.

また、本発明に係る認識装置において、認識器は、変換器の出力が認識器の入力となるように結合されたニューラルネットワークに第１の階調範囲内の階調値を有する画素からなる学習用第１画像を入力した場合に出力される認識結果を学習用第１画像に対して出力されるべき認識結果として予め設定された学習用認識結果に近づけるように学習された学習済みニューラルネットワークである、ことが好ましい。 Furthermore, in the recognition device according to the present invention, the recognizer includes a neural network connected such that the output of the converter becomes the input of the recognizer. This is a trained neural network that has been trained so that the recognition result output when the first image for learning is input is closer to the recognition result for learning that is set in advance as the recognition result that should be output for the first image for learning. It is preferable that there is.

本発明に係る画像認識方法は、所定の対象が撮像され得る空間を撮像して、第１の階調範囲内の階調値を有する画素からなる第１画像を生成し、第１画像が入力された場合に第１の階調範囲よりも小さい第２の階調範囲内の階調値を有する画素からなる第２画像を出力する変換器により、第１画像を第２画像に変換し、第２画像が入力された場合に対象を認識するための処理を第２画像に対して行って認識結果を出力する認識器により、第２画像に対する認識結果を生成する、ことを含むことを特徴とする。 The image recognition method according to the present invention images a space in which a predetermined object can be imaged, generates a first image consisting of pixels having a gradation value within a first gradation range, and the first image is an input image. converting the first image into a second image by a converter that outputs a second image consisting of pixels having tone values within a second tone range smaller than the first tone range when The feature includes generating a recognition result for the second image by a recognizer that performs processing for recognizing a target on the second image when the second image is input and outputs the recognition result. shall be.

本発明に係る画像認識システム、撮像装置、認識装置、画像認識方法は、画像認識の精度を保ちながら、画像認識の対象である画像のデータ容量を安定して削減することを可能とする。 The image recognition system, imaging device, recognition device, and image recognition method according to the present invention make it possible to stably reduce the data volume of an image that is a target of image recognition while maintaining the accuracy of image recognition.

本発明の概要を説明するための模式図である。FIG. 1 is a schematic diagram for explaining the outline of the present invention. 画像認識システム１の概略構成の一例を示す図である。1 is a diagram showing an example of a schematic configuration of an image recognition system 1. FIG. 学習装置２の概略構成の一例を示す図である。1 is a diagram showing an example of a schematic configuration of a learning device 2. FIG. 撮像装置３の概略構成の一例を示す図である。1 is a diagram showing an example of a schematic configuration of an imaging device 3. FIG. 認識装置４の概略構成の一例を示す図である。4 is a diagram showing an example of a schematic configuration of a recognition device 4. FIG. 変換器の概要について説明するための模式図である。FIG. 2 is a schematic diagram for explaining the outline of a converter. 識別器の概要について説明するための模式図である。FIG. 2 is a schematic diagram for explaining the outline of a discriminator. 学習用データ２１１のデータ構造の一例を示す図である。2 is a diagram showing an example of a data structure of learning data 211. FIG. 学習処理の流れの一例を示すフロー図である。FIG. 3 is a flow diagram showing an example of the flow of learning processing. 画像認識処理の流れの一例を示すシーケンス図である。FIG. 2 is a sequence diagram showing an example of the flow of image recognition processing. 認識結果画面７００の一例を示す図である。7 is a diagram showing an example of a recognition result screen 700. FIG.

以下、図面を参照しつつ、本発明の様々な実施形態について説明する。ただし、本発明の技術的範囲はそれらの実施形態に限定されず、特許請求の範囲に記載された発明とその均等物に及ぶ点に留意されたい。 Hereinafter, various embodiments of the present invention will be described with reference to the drawings. However, it should be noted that the technical scope of the present invention is not limited to these embodiments, but extends to the invention described in the claims and equivalents thereof.

（本発明の概要）
図１は、本発明の概要について説明するための模式図である。本発明に係る画像認識システムは、撮像手段と、変換手段と、認識手段とを有する。 (Summary of the present invention)
FIG. 1 is a schematic diagram for explaining the outline of the present invention. The image recognition system according to the present invention includes an imaging means, a conversion means, and a recognition means.

撮像手段は、所定の対象が撮像され得る空間を撮像して、第１の階調範囲内の階調値を有する画素からなる第１画像を生成する。所定の対象は例えば人であり、第１画像は、例えば、ＲＧＢの３チャネルのそれぞれについて０～２５５の階調値を有する画素からなる画像である。変換手段は、変換器により、撮像手段によって生成された第１画像を第１の階調範囲よりも小さい第２の階調範囲内の階調値を有する画素からなる第２画像に変換する。第２画像は、例えば、０又は１の階調値を有する画素からなる画像である。認識手段は、認識器により、変換手段によって変換された第２画像に対する所定の対象の認識結果を生成する。認識結果は、例えば、第２画像に写っている人の像に外接する矩形領域を示す情報である。 The imaging means images a space in which a predetermined object can be imaged, and generates a first image made up of pixels having gradation values within a first gradation range. The predetermined object is, for example, a person, and the first image is, for example, an image consisting of pixels having tone values of 0 to 255 for each of the three RGB channels. The converting means uses a converter to convert the first image generated by the imaging means into a second image made up of pixels having gradation values within a second gradation range smaller than the first gradation range. The second image is, for example, an image composed of pixels having a gradation value of 0 or 1. The recognition means uses the recognizer to generate a recognition result of a predetermined object for the second image converted by the conversion means. The recognition result is, for example, information indicating a rectangular area circumscribing the image of the person in the second image.

変換器及び認識器は、変換器の出力が認識器の入力となるように結合されたニューラルネットワークを学習させることにより生成された学習済みニューラルネットワークである。学習は、第１の階調範囲内の階調値を有する画素からなる学習用第１画像が入力された場合にニューラルネットワークから出力される認識結果を学習用第１画像に対して出力されるべき認識結果として予め設定された学習用認識結果に近づけるように行われる。 The converter and recognizer are trained neural networks created by training coupled neural networks such that the output of the converter becomes the input of the recognizer. In the learning, when a first learning image consisting of pixels having gradation values within a first gradation range is input, the recognition result output from the neural network is output for the first learning image. This is performed so as to approximate the learning recognition result set in advance as the expected recognition result.

このように、画像認識システムにおいて、変換手段は、第１画像が入力された場合に第２画像を出力する変換器により、第１画像を第２画像に変換する。このようにすることで、画像認識システムは、画像認識の精度を保ちながら、画像認識の対象である画像のデータ容量を安定して削減することを可能とする。すなわち、第１画像が変換された第２画像のデータ容量は、第２の階調範囲及び画素数によって定まり、第１画像の内容に依存しない。したがって、画像認識システムは、第１画像を第２画像に変換することにより、第１画像の内容にかかわらず画像のデータ容量を安定して削減することを可能とする。 In this manner, in the image recognition system, the converting means converts the first image into the second image using a converter that outputs the second image when the first image is input. By doing so, the image recognition system makes it possible to stably reduce the data volume of the image that is the object of image recognition while maintaining the accuracy of image recognition. That is, the data capacity of the second image obtained by converting the first image is determined by the second gradation range and the number of pixels, and does not depend on the contents of the first image. Therefore, by converting the first image into the second image, the image recognition system makes it possible to stably reduce the data capacity of the image regardless of the content of the first image.

また、第１画像を入力された場合に第２画像を出力する変換器は、変換器の出力が認識器の入力となるように結合されたニューラルネットワークを、学習用第１画像及び学習用認識結果を用いて学習させることにより生成される。このようにすることで、画像認識システムは、認識器において高い精度での画像認識が可能となるような第２画像を変換器に出力させることが可能となる。 In addition, a converter that outputs a second image when the first image is input is a neural network that is connected so that the output of the converter becomes the input of the recognizer, and the first image for learning and the recognition for learning. It is generated by learning using the results. By doing so, the image recognition system can cause the converter to output a second image that allows the recognizer to perform image recognition with high accuracy.

なお、上述した図１の説明は、本発明の内容への理解を深めるための説明にすぎない。本発明は、具体的には、次に説明する各実施形態において実施され、且つ、本発明の原則を実質的に超えずに、さまざまな変形例によって実施されてもよい。このような変形例はすべて、本発明および本明細書の開示範囲に含まれる。 Note that the above description of FIG. 1 is merely an explanation for deepening understanding of the content of the present invention. Specifically, the present invention is implemented in each embodiment described below, and may be implemented by various modifications without substantially exceeding the principles of the present invention. All such variations are within the scope of the invention and disclosure herein.

（システムの概略構成）
図２は、画像認識システム１の概略構成の一例を示す図である。画像認識システム１は、学習装置２と、撮像装置３と、認識装置４と、表示装置５とを有する。学習装置２、撮像装置３、認識装置４及び表示装置５は、インターネット又はイントラネット等の伝送網６を介して相互に接続される。 (Schematic system configuration)
FIG. 2 is a diagram showing an example of a schematic configuration of the image recognition system 1. As shown in FIG. The image recognition system 1 includes a learning device 2, an imaging device 3, a recognition device 4, and a display device 5. The learning device 2, the imaging device 3, the recognition device 4, and the display device 5 are interconnected via a transmission network 6 such as the Internet or an intranet.

学習装置２は、サーバ又はＰＣ（Personal Computer）等の情報処理装置である。学習装置２は、学習済みニューラルネットワークである変換器及び認識器を同時学習により生成する。変換器は、多値画像が入力された場合に二値画像を出力するニューラルネットワークである。認識器は、二値画像が入力された場合に二値画像内における人の領域を出力するニューラルネットワークである。なお、変換器及び認識器の同時学習とは、変換器及び認識器を結合したニューラルネットワークを学習させることをいう。また、多値画像は第１画像の一例であり、二値画像は第２画像の一例である。 The learning device 2 is an information processing device such as a server or a PC (Personal Computer). The learning device 2 generates a converter and a recognizer, which are trained neural networks, through simultaneous learning. The converter is a neural network that outputs a binary image when a multivalued image is input. The recognizer is a neural network that, when a binary image is input, outputs a human region in the binary image. Note that simultaneous learning of a converter and a recognizer refers to training a neural network that combines a converter and a recognizer. Further, the multivalued image is an example of the first image, and the binary image is an example of the second image.

撮像装置３は、例えば、監視カメラである。撮像装置３は、例えば、建物内の一室を撮像することにより当該部屋の多値画像を生成する。撮像装置３は、学習装置２により生成された変換器により多値画像を二値画像に変換する。撮像装置３は、変換された二値画像を伝送網６に出力する。なお、上記部屋は対象が撮像され得る空間の一例である。 The imaging device 3 is, for example, a surveillance camera. For example, the imaging device 3 generates a multivalued image of a room in a building by capturing an image of the room. The imaging device 3 converts the multivalued image into a binary image using the converter generated by the learning device 2. The imaging device 3 outputs the converted binary image to the transmission network 6. Note that the above room is an example of a space where an object can be imaged.

認識装置４は、サーバ又はＰＣ等の情報処理装置である。認識装置４は、撮像装置３によって出力された二値画像を伝送網６から取得する。認識装置４は、学習装置２により生成された認識器により、二値画像に対する認識結果を生成する。認識装置４は、生成された認識結果を伝送網６に出力する。取得される二値画像のデータ容量は元の多値画像に比べて少なく、且つ固定値である。そのため、認識装置４や伝送網６が要する伝送容量を削減することができ、認識装置４が一時記憶したり保管するために要する記憶容量も削減することができる。また、二値画像は認識器と同時学習された変換器により生成されるため変換による認識精度を高く維持したまま伝送容量や記憶容量を削減できる。 The recognition device 4 is an information processing device such as a server or a PC. The recognition device 4 acquires the binary image output by the imaging device 3 from the transmission network 6. The recognition device 4 uses the recognizer generated by the learning device 2 to generate a recognition result for the binary image. The recognition device 4 outputs the generated recognition result to the transmission network 6. The data capacity of the obtained binary image is smaller than that of the original multivalued image, and is a fixed value. Therefore, the transmission capacity required by the recognition device 4 and the transmission network 6 can be reduced, and the storage capacity required for temporary storage and storage by the recognition device 4 can also be reduced. Furthermore, since the binary image is generated by a converter trained simultaneously with the recognizer, the transmission capacity and storage capacity can be reduced while maintaining high recognition accuracy through conversion.

表示装置５は、サーバ又はＰＣ等の情報処理装置である。表示装置５は、認識装置４によって出力された認識結果を伝送網６から取得する。表示装置５は、取得された認識結果を表示装置５が備える液晶ディスプレイ等の表示部に表示する。 The display device 5 is an information processing device such as a server or a PC. The display device 5 acquires the recognition result output by the recognition device 4 from the transmission network 6. The display device 5 displays the acquired recognition results on a display section such as a liquid crystal display included in the display device 5.

図３は、学習装置２の概略構成の一例を示す図である。学習装置２は、第１記憶部２１と、第１通信部２２と、第１処理部２３とを備える。 FIG. 3 is a diagram showing an example of a schematic configuration of the learning device 2. As shown in FIG. The learning device 2 includes a first storage section 21, a first communication section 22, and a first processing section 23.

第１記憶部２１は、プログラム又はデータを記憶するためのデバイスであり、例えば、半導体メモリ装置を備える。第１記憶部２１は、第１処理部２３による処理に用いられるオペレーティングシステムプログラム、ドライバプログラム、アプリケーションプログラム、データ等を記憶する。プログラムは、例えば、ＣＤ（Compact Disc）－ＲＯＭ（Read Only Memory）、ＤＶＤ（Digital Versatile Disc）－ＲＯＭ等のコンピュータ読み取り可能且つ非一時的な可搬型記憶媒体から、公知のセットアッププログラム等を用いて第１記憶部２１にインストールされる。 The first storage unit 21 is a device for storing programs or data, and includes, for example, a semiconductor memory device. The first storage unit 21 stores operating system programs, driver programs, application programs, data, etc. used in processing by the first processing unit 23. The program can be executed using a known setup program or the like from a computer-readable, non-temporary, portable storage medium such as a CD (Compact Disc)-ROM (Read Only Memory) or a DVD (Digital Versatile Disc)-ROM. It is installed in the first storage unit 21.

また、第１記憶部２１は、学習用データ２１１及び学習用モデル２１２を記憶する。 Further, the first storage unit 21 stores learning data 211 and a learning model 212.

第１通信部２２は、学習装置２を他の装置と通信可能にする通信インタフェース回路を備える。第１通信部２２が備える通信インタフェース回路は、有線ＬＡＮ（Local Area Network）又は無線ＬＡＮ等の通信インタフェース回路である。第１通信部２２は、他の装置から送信されたデータを受信し、第１処理部２３に供給するとともに、第１処理部２３から供給されたデータを他の装置に送信する。 The first communication unit 22 includes a communication interface circuit that enables the learning device 2 to communicate with other devices. The communication interface circuit included in the first communication unit 22 is a communication interface circuit such as a wired LAN (Local Area Network) or a wireless LAN. The first communication unit 22 receives data transmitted from another device and supplies it to the first processing unit 23, and also transmits the data supplied from the first processing unit 23 to the other device.

第１処理部２３は、一又は複数個のプロセッサ及びその周辺回路を備える。第１処理部２３は、例えばＣＰＵ（Central Processing Unit）であり、学習装置２の動作を統括的に制御する。第１処理部２３は、ＧＰＵ（Graphics Processing Unit）、ＤＳＰ（Digital Signal Processor）、ＬＳＩ（Large-Scaled IC）、ＡＳＩＣ（Application Specific Integrated Circuit）、ＦＰＧＡ（Field Programmable Gate Array）等でもよい。第１処理部２３は、第１記憶部２１に記憶されているプログラムに基づいて学習装置２の各種処理が適切な手順で実行されるように、第１通信部２２の動作を制御するとともに、各種の処理を実行する。また、第１処理部２３は、複数のプログラムを並列に実行することができる。 The first processing unit 23 includes one or more processors and their peripheral circuits. The first processing unit 23 is, for example, a CPU (Central Processing Unit), and controls the operation of the learning device 2 in an integrated manner. The first processing unit 23 may be a GPU (Graphics Processing Unit), a DSP (Digital Signal Processor), an LSI (Large-Scaled IC), an ASIC (Application Specific Integrated Circuit), an FPGA (Field Programmable Gate Array), or the like. The first processing unit 23 controls the operation of the first communication unit 22 so that various processes of the learning device 2 are executed in an appropriate procedure based on the program stored in the first storage unit 21, and Perform various processing. Further, the first processing unit 23 can execute multiple programs in parallel.

第１処理部２３は、学習用モデル取得手段２３１、学習用データ取得手段２３２、エッジ画像生成手段２３３、学習手段２３４及び出力手段２３５を備える。これらの各手段は、第１処理部２３によって実行されるプログラムによって実現される機能モジュールである。これらの各手段は、ファームウェアとして学習装置２に実装されてもよい。 The first processing unit 23 includes learning model acquisition means 231 , learning data acquisition means 232 , edge image generation means 233 , learning means 234 , and output means 235 . Each of these means is a functional module realized by a program executed by the first processing unit 23. Each of these means may be implemented in the learning device 2 as firmware.

図４は、撮像装置３の概略構成の一例を示す図である。撮像装置３は、第２記憶部３１と、第２通信部３２と、撮像部３３と、第２処理部３４とを備える。 FIG. 4 is a diagram showing an example of a schematic configuration of the imaging device 3. As shown in FIG. The imaging device 3 includes a second storage section 31, a second communication section 32, an imaging section 33, and a second processing section 34.

第２記憶部３１は、プログラム又はデータを記憶するためのデバイスであり、例えば、半導体メモリ装置を備える。第２記憶部３１は、第２処理部３４による処理に用いられるオペレーティングシステムプログラム、ドライバプログラム、アプリケーションプログラム、データ等を記憶する。プログラムは、例えば、ＣＤ－ＲＯＭ、ＤＶＤ－ＲＯＭ等のコンピュータ読み取り可能且つ非一時的な可搬型記憶媒体から、公知のセットアッププログラム等を用いて第２記憶部３１にインストールされる。 The second storage unit 31 is a device for storing programs or data, and includes, for example, a semiconductor memory device. The second storage unit 31 stores operating system programs, driver programs, application programs, data, etc. used in processing by the second processing unit 34. The program is installed in the second storage unit 31 from a computer-readable, non-temporary, portable storage medium such as a CD-ROM or DVD-ROM using a known setup program or the like.

第２通信部３２は、撮像装置３を他の装置と通信可能にする通信インタフェース回路を備える。第２通信部３２が備える通信インタフェース回路は、有線ＬＡＮ又は無線ＬＡＮ等の通信インタフェース回路である。第２通信部３２は、他の装置から送信されたデータを受信し、第２処理部３４に供給するとともに、第２処理部３４から供給されたデータを他の装置に送信する。 The second communication unit 32 includes a communication interface circuit that enables the imaging device 3 to communicate with other devices. The communication interface circuit included in the second communication unit 32 is a communication interface circuit such as a wired LAN or a wireless LAN. The second communication unit 32 receives data transmitted from another device and supplies it to the second processing unit 34, and also transmits the data supplied from the second processing unit 34 to the other device.

撮像部３３は、結像光学系、撮像素子及び画像処理部等を備える。結像光学系は、例えば光学レンズであり、被写体からの光束を撮像素子の撮像面上に結像させる。撮像素子は、例えば、ＣＣＤ（Charge Coupled Device）又はＣＭＯＳ（Complementary Metal Oxide Semiconductor）等であり、撮像面上に結像した被写体像の画像信号を出力する。画像処理部は、撮像素子によって生成された画像信号から所定の形式の画像データを生成して第２処理部３４に供給する。 The imaging unit 33 includes an imaging optical system, an image sensor, an image processing unit, and the like. The imaging optical system is, for example, an optical lens, and forms an image of the light beam from the subject onto the imaging surface of the imaging element. The image sensor is, for example, a CCD (Charge Coupled Device) or a CMOS (Complementary Metal Oxide Semiconductor), and outputs an image signal of a subject image formed on an imaging surface. The image processing section generates image data in a predetermined format from the image signal generated by the image sensor and supplies the generated image data to the second processing section 34 .

第２処理部３４は、一又は複数個のプロセッサ及びその周辺回路を備える。第２処理部３４は、例えばＣＰＵであり、撮像装置３の動作を統括的に制御する。第２処理部３４は、ＧＰＵ、ＤＳＰ、ＬＳＩ、ＡＳＩＣ、ＦＰＧＡ等でもよい。第２処理部３４は、第２記憶部３１に記憶されているプログラムに基づいて撮像装置３の各種処理が適切な手順で実行されるように、第２通信部３２及び撮像部３３の動作を制御するとともに、各種の処理を実行する。また、第２処理部３４は、複数のプログラムを並列に実行することができる。 The second processing unit 34 includes one or more processors and their peripheral circuits. The second processing unit 34 is, for example, a CPU, and controls the operation of the imaging device 3 in an integrated manner. The second processing unit 34 may be a GPU, DSP, LSI, ASIC, FPGA, or the like. The second processing unit 34 controls the operations of the second communication unit 32 and the imaging unit 33 so that various processes of the imaging device 3 are executed in appropriate procedures based on the programs stored in the second storage unit 31. Control and execute various processes. Further, the second processing unit 34 can execute multiple programs in parallel.

第２処理部３４は、撮像手段３４１、変換手段３４２及び二値画像出力手段３４３を備える。これらの各手段は、第２処理部３４によって実行されるプログラムによって実現される機能モジュールである。これらの各手段は、ファームウェアとして撮像装置３に実装されてもよい。 The second processing section 34 includes an imaging means 341, a conversion means 342, and a binary image output means 343. Each of these means is a functional module realized by a program executed by the second processing unit 34. Each of these means may be implemented in the imaging device 3 as firmware.

図５は、認識装置４の概略構成の一例を示す図である。認識装置４は、第３記憶部４１と、第３通信部４２と、第３処理部４３とを備える。 FIG. 5 is a diagram showing an example of a schematic configuration of the recognition device 4. As shown in FIG. The recognition device 4 includes a third storage section 41, a third communication section 42, and a third processing section 43.

第３記憶部４１は、プログラム又はデータを記憶するためのデバイスであり、例えば、半導体メモリ装置を備える。第３記憶部４１は、第３処理部４３による処理に用いられるオペレーティングシステムプログラム、ドライバプログラム、アプリケーションプログラム、データ等を記憶する。プログラムは、例えば、ＣＤ－ＲＯＭ、ＤＶＤ－ＲＯＭ等のコンピュータ読み取り可能且つ非一時的な可搬型記憶媒体から、公知のセットアッププログラム等を用いて第３記憶部４１にインストールされる。 The third storage unit 41 is a device for storing programs or data, and includes, for example, a semiconductor memory device. The third storage unit 41 stores operating system programs, driver programs, application programs, data, etc. used in processing by the third processing unit 43. The program is installed in the third storage unit 41 from a computer-readable, non-temporary, portable storage medium such as a CD-ROM or DVD-ROM using a known setup program or the like.

第３通信部４２は、認識装置４を他の装置と通信可能にする通信インタフェース回路を備える。第３通信部４２が備える通信インタフェース回路は、有線ＬＡＮ又は無線ＬＡＮ等の通信インタフェース回路である。第３通信部４２は、他の装置から送信されたデータを受信し、第３処理部４３に供給するとともに、第３処理部４３から供給されたデータを他の装置に送信する。 The third communication unit 42 includes a communication interface circuit that enables the recognition device 4 to communicate with other devices. The communication interface circuit included in the third communication unit 42 is a communication interface circuit such as a wired LAN or a wireless LAN. The third communication unit 42 receives data transmitted from another device and supplies it to the third processing unit 43, and also transmits the data supplied from the third processing unit 43 to the other device.

第３処理部４３は、一又は複数個のプロセッサ及びその周辺回路を備える。第３処理部４３は、例えばＣＰＵであり、認識装置４の動作を統括的に制御する。第３処理部４３は、ＧＰＵ、ＤＳＰ、ＬＳＩ、ＡＳＩＣ、ＦＰＧＡ等でもよい。第３処理部４３は、第３記憶部４１に記憶されているプログラムに基づいて認識装置４の各種処理が適切な手順で実行されるように、第３通信部４２の動作を制御するとともに、各種の処理を実行する。また、第３処理部４３は、複数のプログラムを並列に実行することができる。 The third processing unit 43 includes one or more processors and their peripheral circuits. The third processing unit 43 is, for example, a CPU, and controls the operation of the recognition device 4 in an integrated manner. The third processing unit 43 may be a GPU, DSP, LSI, ASIC, FPGA, or the like. The third processing unit 43 controls the operation of the third communication unit 42 so that various processes of the recognition device 4 are executed in an appropriate procedure based on the program stored in the third storage unit 41, and Perform various processing. Further, the third processing unit 43 can execute multiple programs in parallel.

第３処理部４３は、二値画像取得手段４３１、認識手段４３２及び認識結果出力手段４３３を備える。これらの各手段は、第３処理部４３によって実行されるプログラムによって実現される機能モジュールである。これらの各手段は、ファームウェアとして認識装置４に実装されてもよい。 The third processing section 43 includes a binary image acquisition means 431, a recognition means 432, and a recognition result output means 433. Each of these means is a functional module realized by a program executed by the third processing unit 43. Each of these means may be implemented in the recognition device 4 as firmware.

（変換器及び識別器の概要）
図６は、変換器の概要について説明するための模式図である。変換器は、多値画像が入力された場合に二値画像を出力する畳み込みニューラルネットワーク（Convolutional Neural Network；ＣＮＮ）であり、入力層、隠れ層及び出力層を有する。隠れ層は、畳み込み層、プーリング層及びアンプーリング層等である。 (Overview of converter and discriminator)
FIG. 6 is a schematic diagram for explaining the outline of the converter. The converter is a convolutional neural network (CNN) that outputs a binary image when a multivalued image is input, and has an input layer, a hidden layer, and an output layer. Hidden layers include convolution layers, pooling layers, unpooling layers, and the like.

変換器の入力層は、複数の多値画像Ｄ１を入力として受け付ける。多値画像Ｄ１は、例えば、ＲＧＢの３チャネルのそれぞれについて０～２５５の階調範囲内の階調値を有する画素からなる画像である。 The input layer of the converter receives a plurality of multivalued images D1 as input. The multivalued image D1 is, for example, an image composed of pixels having tone values within a tone range of 0 to 255 for each of the three RGB channels.

変換器の畳み込み層Ｐ１０１は、入力層に入力された複数の多値画像Ｄ１に対して、所定のサイズ及び係数を有する複数のフィルタによる畳み込み処理を実行し、特徴マップを生成する。生成される特徴マップは、多値画像Ｄ１と同一のサイズ及びフィルタの数と同数のチャネル数を有する（フィルタ数が２５６個なら２５６チャネル）。畳み込み層Ｐ１０１は、生成された特徴マップに対してバッチ正規化（Batch Normalization）処理を実行し、生成された特徴マップの特徴量がチャネルごとに所定の平均値及び分散値を有するように、各特徴量を補正する。畳み込み層Ｐ１０１は、バッチ正規化処理により補正された各特徴量に対して活性化関数（Activation Function）を適用する活性化処理を実行する。活性化関数は、例えば、ＲｅＬＵ（Rectified Linear Unit）関数である。活性化関数は、双曲線正接（Hyperbolic Tangent）関数でもよく、シグモイド（Sigmoid）関数でもよい。畳み込み層Ｐ１０１は、活性化関数を適用する前に、各特徴量に対して所定のバイアス値を加えてもよい。 The convolution layer P101 of the converter performs convolution processing using a plurality of filters having predetermined sizes and coefficients on the plurality of multivalued images D1 input to the input layer, and generates a feature map. The generated feature map has the same size as the multivalued image D1 and the same number of channels as the number of filters (256 channels if the number of filters is 256). The convolution layer P101 executes batch normalization processing on the generated feature map, and adjusts each channel so that the feature amount of the generated feature map has a predetermined average value and variance value for each channel. Correct the feature amount. The convolution layer P101 executes an activation process that applies an activation function to each feature corrected by the batch normalization process. The activation function is, for example, a ReLU (Rectified Linear Unit) function. The activation function may be a Hyperbolic Tangent function or a Sigmoid function. The convolution layer P101 may add a predetermined bias value to each feature amount before applying the activation function.

プーリング層Ｐ１０２は、畳み込み層Ｐ１０１の出力データである特徴マップに対してプーリング（Pooling）処理を実行する。プーリング処理は、特徴マップのサイズを減少させる処理であり、例えば、特徴マップ内の所定のサイズ（例えば、２×２）の領域に含まれる特徴量のうち最大の特徴量を抽出する最大値プーリング（Max Pooling）処理である。プーリング処理は、平均値プーリング（Average Pooling）処理でもよい。プーリング層Ｐ１０２は、プーリング処理により生成された特徴マップを出力する。プーリング層Ｐ１０２の出力データである特徴マップのサイズは、プーリング層Ｐ１０２の入力データである特徴マップのサイズより小さく、例えば、縦方向、横方向のそれぞれについて入力データのサイズの２分の１である。 The pooling layer P102 performs pooling processing on the feature map that is the output data of the convolutional layer P101. Pooling processing is a process of reducing the size of a feature map, and for example, maximum value pooling that extracts the largest feature amount from among the feature amounts included in a region of a predetermined size (for example, 2 x 2) in the feature map. (Max Pooling) processing. The pooling process may be an average value pooling process. The pooling layer P102 outputs the feature map generated by the pooling process. The size of the feature map that is the output data of the pooling layer P102 is smaller than the size of the feature map that is the input data of the pooling layer P102, for example, it is half the size of the input data in each of the vertical and horizontal directions. .

畳み込み層Ｐ１０３は、プーリング層Ｐ１０２の出力データに対して畳み込み処理、バッチ正規化処理及び活性化処理を実行する。プーリング層Ｐ１０４は、畳み込み層Ｐ１０３の出力データに対してプーリング処理を実行する。プーリング層Ｐ１０４の出力データのサイズは、例えば、縦方向、横方向のそれぞれについてプーリング層Ｐ１０４の入力データのサイズの２分の１である。 The convolution layer P103 performs convolution processing, batch normalization processing, and activation processing on the output data of the pooling layer P102. The pooling layer P104 performs a pooling process on the output data of the convolutional layer P103. The size of the output data of the pooling layer P104 is, for example, half the size of the input data of the pooling layer P104 in both the vertical and horizontal directions.

畳み込み層Ｐ１０５は、プーリング層Ｐ１０４の出力データに対して畳み込み処理、バッチ正規化処理及び活性化処理を実行する。アンプーリング層Ｐ１０６は、畳み込み層Ｐ１０５の出力データに対してアンプーリング（Unpooling）処理を実行する。アンプーリング処理は、特徴マップのサイズを増大させるアップサンプリング処理である。アンプーリング層Ｐ１０６の出力データのサイズは、アンプーリング層Ｐ１０６の入力データのサイズより大きく、例えば、縦方向、横方向のそれぞれについて入力データのサイズの２倍である。 The convolution layer P105 performs convolution processing, batch normalization processing, and activation processing on the output data of the pooling layer P104. The unpooling layer P106 performs unpooling processing on the output data of the convolutional layer P105. The unpooling process is an upsampling process that increases the size of the feature map. The size of the output data of the unpooling layer P106 is larger than the size of the input data of the unpooling layer P106, for example, twice the size of the input data in each of the vertical and horizontal directions.

畳み込み層Ｐ１０７は、アンプーリング層Ｐ１０６の出力データに対して畳み込み処理、バッチ正規化処理及び活性化処理を実行する。加算層Ｐ１０８は、畳み込み層Ｐ１０７の出力データと畳み込み層Ｐ１０３の出力データとを加算する。加算層Ｐ１０８を設けることにより、後述する誤差逆伝播法の適用時において算出される勾配の絶対値が大きくなり、学習速度が向上される。アンプーリング層Ｐ１０９は、加算層Ｐ１０８の出力データに対してアンプーリング処理を実行する。アンプーリング層Ｐ１０９の出力データのサイズは、例えば、縦方向、横方向のそれぞれについてアンプーリング層Ｐ１０９の入力データのサイズの２倍である。 The convolution layer P107 performs convolution processing, batch normalization processing, and activation processing on the output data of the unpooling layer P106. The addition layer P108 adds the output data of the convolution layer P107 and the output data of the convolution layer P103. By providing the addition layer P108, the absolute value of the gradient calculated when applying the error backpropagation method described later becomes large, and the learning speed is improved. The unpooling layer P109 performs unpooling processing on the output data of the addition layer P108. The size of the output data of the unpooling layer P109 is, for example, twice the size of the input data of the unpooling layer P109 in both the vertical and horizontal directions.

畳み込み層Ｐ１１０は、アンプーリング層Ｐ１０９の出力データに対して畳み込み処理、バッチ正規化処理及び活性化処理を実行する。加算層Ｐ１１１は、畳み込み層Ｐ１１０の出力データと畳み込み層Ｐ１０１の出力データとを加算する。 The convolution layer P110 performs convolution processing, batch normalization processing, and activation processing on the output data of the unpooling layer P109. The addition layer P111 adds the output data of the convolution layer P110 and the output data of the convolution layer P101.

変換層Ｐ１１２は、加算層Ｐ１１１の出力データに対してチャネル変換処理を実行する。変換層Ｐ１１２は、各画素についての複数チャネルの特徴量に基づいて、１チャネルの特徴マップを生成して出力する。例えば、加算層Ｐ１１１の出力データがＮチャネルの特徴マップであるとすると、変換層Ｐ１１２は、加算層Ｐ１１１の出力データをＮチャネルのフィルタ１個だけで畳み込んで１チャネルの特徴マップを生成する。これにより、変換層Ｐ１１２は、特徴マップのデータ容量を削減する。 The conversion layer P112 performs channel conversion processing on the output data of the addition layer P111. The conversion layer P112 generates and outputs a one-channel feature map based on feature amounts of multiple channels for each pixel. For example, if the output data of the addition layer P111 is an N-channel feature map, the conversion layer P112 convolves the output data of the addition layer P111 with only one N-channel filter to generate a 1-channel feature map. . Thereby, the conversion layer P112 reduces the data capacity of the feature map.

活性層Ｐ１１３は、変換層Ｐ１１２の出力データに対して活性化関数を適用する活性化処理を実行する。活性化関数は、例えば、シグモイド関数である。活性化層Ｐ１１３は、活性化関数を適用する前に、各特徴量に対して所定のバイアス値を加えてもよい。 The active layer P113 executes an activation process that applies an activation function to the output data of the conversion layer P112. The activation function is, for example, a sigmoid function. The activation layer P113 may add a predetermined bias value to each feature amount before applying the activation function.

閾値処理層Ｐ１１４は、活性層Ｐ１１３の出力データに対して所定の閾値を有する階段関数を適用する閾値処理を実行する。階段関数は、活性層Ｐ１１３の出力である特徴マップに含まれる特徴量が閾値以上であればその特徴量を１に変換し、閾値未満であればその特徴量を０に変換する関数である。これにより、閾値処理層Ｐ１１４は、各画素に対応する特徴量が０又は１である特徴マップを出力する。 The threshold processing layer P114 performs threshold processing that applies a step function having a predetermined threshold to the output data of the active layer P113. The step function is a function that converts the feature amount included in the feature map that is the output of the active layer P113 to 1 if it is equal to or greater than a threshold value, and converts the feature amount to 0 if it is less than the threshold value. Thereby, the threshold processing layer P114 outputs a feature map in which the feature amount corresponding to each pixel is 0 or 1.

変換器の出力層は、閾値処理層Ｐ１１４の出力である特徴マップの特徴量を各画素の階調値とする二値画像Ｄ２を出力する。二値画像Ｄ２は、多値画像Ｄ１と同一のサイズを有し、各画素の階調値が０又は１である画像である。このようにして、変換器は、多値画像Ｄ１が入力された場合に二値画像Ｄ２を出力する。 The output layer of the converter outputs a binary image D2 in which the feature amount of the feature map output from the threshold processing layer P114 is used as the tone value of each pixel. The binary image D2 has the same size as the multivalued image D1, and the tone value of each pixel is 0 or 1. In this way, the converter outputs the binary image D2 when the multivalued image D1 is input.

なお、閾値処理層Ｐ１１４は、学習時、階段関数を適用する前に、活性層Ｐ１１３の出力である特徴マップにノイズを重畳してもよい（認識時は重畳しない）。例えば、閾値処理層Ｐ１１４は、特徴マップの各特徴量に、所定の分散値を有する、正規分布等の分布に基づいて生成された乱数を加算する。これにより、変換器は、活性層Ｐ１１３の出力の全ての特徴量が閾値未満、又は全ての特徴量が閾値以上である場合でも、二値画像Ｄ２の全ての画素の階調値が０又は１の何れかのみとなる確率を低減させる。二値画像Ｄ２の全ての画素の階調値が０又は１となってしまった場合、後述する認識器にその二値画像Ｄ２が入力されたとしても学習が行えなくなるため、学習速度が低下する。変換器は、そのような二値画像Ｄ２を出力する可能性を低減させることにより、学習速度を向上させることができる。 Note that during learning, the threshold processing layer P114 may superimpose noise on the feature map that is the output of the active layer P113 (not superimpose during recognition) before applying the step function. For example, the threshold processing layer P114 adds a random number generated based on a distribution such as a normal distribution and having a predetermined variance value to each feature amount of the feature map. As a result, the converter allows the gradation value of all pixels of the binary image D2 to be 0 or 1 even if all the feature quantities output from the active layer P113 are less than the threshold value or all the feature quantities are greater than or equal to the threshold value. Reduce the probability that only one of the following will occur. If the gradation values of all pixels of the binary image D2 become 0 or 1, learning cannot be performed even if the binary image D2 is input to the recognizer described later, so the learning speed decreases. . The converter can improve the learning speed by reducing the possibility of outputting such a binary image D2.

また、この場合において、閾値処理層Ｐ１１４は、特徴マップの特徴量に応じた大きさのノイズを重畳してもよい。例えば、閾値処理層Ｐ１１４は、各特徴量について、各特徴量に乱数を加算した場合に閾値との関係が変化する確率が所定確率（例えば、１０００分の１）となる乱数の分布を決定する。閾値との関係が変化するとは、閾値未満である特徴量に乱数を加算した場合に閾値以上となること、又は、閾値以上である特徴量に乱数を加算した場合に閾値未満となることである。閾値処理層Ｐ１１４は、各特徴量について決定された分布に基づいて乱数をそれぞれ生成し、生成された乱数を各特徴量に加算する。これにより、変換器は、二値画像Ｄ２の全ての画素の階調値が０又は１となる確率を低下させつつ、ノイズによって多値画像Ｄ１との相関がない二値画像Ｄ２が出力される確率を低減させることができる。 Further, in this case, the threshold processing layer P114 may superimpose noise of a size according to the feature amount of the feature map. For example, the threshold processing layer P114 determines, for each feature quantity, a distribution of random numbers such that when the random number is added to each feature quantity, the probability that the relationship with the threshold value changes is a predetermined probability (for example, 1/1000). . The relationship with the threshold value changes means that when a random number is added to a feature quantity that is less than the threshold value, the value becomes greater than or equal to the threshold value, or when a random number is added to a feature quantity that is greater than or equal to the threshold value, it becomes less than the threshold value. . The threshold processing layer P114 generates random numbers based on the distribution determined for each feature amount, and adds the generated random numbers to each feature amount. As a result, the converter outputs a binary image D2 that has no correlation with the multilevel image D1 due to noise while reducing the probability that the tone values of all pixels of the binary image D2 will be 0 or 1. The probability can be reduced.

また、閾値処理層Ｐ１１４は、各特徴量の平均値、中央値等の統計値に基づいて一つの分布を決定し、決定された一つの分布に基づいて生成された乱数を各特徴量に加算してもよい。これにより、変換器は、少ない計算負荷でノイズを重畳することができる。 In addition, the threshold processing layer P114 determines one distribution based on statistical values such as the average value and median value of each feature amount, and adds a random number generated based on the determined one distribution to each feature amount. You may. This allows the converter to superimpose noise with less computational load.

なお、変換器において、加算層Ｐ１０８及びＰ１１１は設けられなくてもよい。 Note that the addition layers P108 and P111 may not be provided in the converter.

図７は、認識器の概要について説明するための模式図である。認識器は、二値画像が入力された場合に対象の領域及び対象の種別を出力するＣＮＮであり、例えば、ＳＳＤ（Single Shot Multibox Detector）である。対象の領域は、入力された二値画像において対象の像に外接する矩形領域を示す情報である。対象の種別は、矩形領域に含まれる対象が、あらかじめ設定された複数の対象の種別の何れに該当するかを示す情報である。対象の種別は、例えば、「人」、「車両」又は「椅子」等である。対象の種別は、「人の上半身」等でもよい。なお、認識すべき対象の種別が一種類（例えば、「人」のみ）である場合、認識器は、対象の種別を出力しなくてもよい。 FIG. 7 is a schematic diagram for explaining the outline of the recognizer. The recognizer is a CNN that outputs the target area and target type when a binary image is input, and is, for example, an SSD (Single Shot Multibox Detector). The target area is information indicating a rectangular area circumscribing the target image in the input binary image. The target type is information indicating which of a plurality of preset target types the target included in the rectangular area corresponds to. The type of target is, for example, "person", "vehicle", or "chair". The type of target may be "upper body of a person" or the like. Note that if the type of target to be recognized is one type (for example, only "person"), the recognizer does not need to output the type of target.

認識器の入力層は、二値画像Ｄ３を入力として受け付ける。二値画像Ｄ３は、変換器から出力された二値画像Ｄ２である。 The input layer of the recognizer receives the binary image D3 as input. Binary image D3 is binary image D2 output from the converter.

ベースネットワーク（Base Network）Ｐ２０１は、複数の畳み込み層及び全結合層を有するＣＮＮである。ベースネットワークＰ２０１は、画像分類のために用いられる任意のＣＮＮであってよく、例えば、ＶＧＧ－１６等である。ベースネットワークＰ２０１は、二値画像Ｄ３を入力された場合に、特徴マップを出力する。 The base network P201 is a CNN having multiple convolutional layers and fully connected layers. The base network P201 may be any CNN used for image classification, such as VGG-16. The base network P201 outputs a feature map when receiving the binary image D3.

特徴層Ｐ２０２は、ベースネットワークＰ２０１の出力データを入力として受け付ける。特徴層Ｐ２０２は、入力された特徴マップに畳み込み処理を実行し、入力データよりも小さいサイズの特徴マップを出力する。また、特徴層Ｐ２０２は、出力される特徴マップの各画素の特徴量から推定される矩形領域を示す領域情報を出力するとともに、複数の対象の種別のそれぞれについて、その矩形領域に各種別の対象が含まれる可能性を示す信頼度情報を出力する。領域情報は、例えば、矩形領域の中心座標並びに矩形領域の幅及び高さの情報である。信頼度情報は、例えば、対象の各種別に対応する、０以上１以下の値で示される複数の変数からなるベクトルであり、各変数は、その値が１に近いほど対応する種別の対象が含まれる可能性が高いことを示す。 The feature layer P202 receives the output data of the base network P201 as input. The feature layer P202 performs convolution processing on the input feature map and outputs a feature map smaller in size than the input data. In addition, the feature layer P202 outputs region information indicating a rectangular region estimated from the feature amount of each pixel of the output feature map, and also outputs region information indicating a rectangular region estimated from the feature amount of each pixel of the output feature map. Output reliability information indicating the possibility of being included. The area information is, for example, information on the center coordinates of the rectangular area and the width and height of the rectangular area. The reliability information is, for example, a vector consisting of a plurality of variables corresponding to each type of target and represented by a value between 0 and 1, and the closer the value of each variable is to 1, the more targets of the corresponding type are included. Indicates that there is a high possibility that the

特徴層Ｐ２０３は、特徴層Ｐ２０２の出力データである特徴マップを入力として受け付ける。特徴層Ｐ２０３は、特徴層Ｐ２０２と同様に、畳み込み処理を実行し、入力データよりも小さいサイズの特徴マップ、並びに、その特徴マップについての領域情報及び信頼度情報を出力する。 The feature layer P203 receives as input the feature map that is the output data of the feature layer P202. Similar to the feature layer P202, the feature layer P203 executes convolution processing and outputs a feature map smaller in size than the input data, as well as area information and reliability information about the feature map.

特徴層Ｐ２０３の次に、さらに任意の数の特徴層が設けられてもよい。 After the feature layer P203, any number of feature layers may be provided.

後処理部Ｐ２０４は、各特徴層から出力された領域情報と信頼度情報とを入力として受け付ける。後処理部Ｐ２０４は、入力された信頼度情報に基づいて、各領域情報に示される矩形領域に何れかの種別の対象が含まれるか否か、及び、含まれる場合には何れの種別の対象が含まれるかを判定する。判定は、例えば、信頼度情報に含まれる各変数の値が所定値以上であるか否か、及び、所定値以上である変数が複数である場合には、何れの変数の値が最も大きいかに基づいて行われる。後処理部Ｐ２０４は、同一の種別の対象が含まれると判定され、且つ、領域が所定比率以上重複している複数の矩形領域を統合する。矩形領域の統合には、例えば、Non-Maximum Suppression等の方法が用いられる。これにより、一の対象に対して一の矩形領域が生成される。後処理部Ｐ２０４は、出力層を介して、生成された矩形領域の領域情報を対象の領域Ｄ４として出力するとともに、その矩形領域に対応する信頼度情報を対象の種別Ｄ５として出力する。 The post-processing unit P204 receives as input the area information and reliability information output from each feature layer. Based on the input reliability information, the post-processing unit P204 determines whether any type of target is included in the rectangular area indicated by each area information, and if so, which type of target is included. Determine whether it is included. The determination is, for example, whether the value of each variable included in the reliability information is greater than or equal to a predetermined value, and if there are multiple variables that are greater than or equal to the predetermined value, which variable has the largest value. It is carried out based on. The post-processing unit P204 integrates a plurality of rectangular areas that are determined to include objects of the same type and whose areas overlap by a predetermined ratio or more. For example, a method such as Non-Maximum Suppression is used to integrate the rectangular areas. As a result, one rectangular area is generated for one object. The post-processing unit P204 outputs the area information of the generated rectangular area as the target area D4 via the output layer, and also outputs the reliability information corresponding to the rectangular area as the target type D5.

（各種データのデータ構造）
図８は、学習装置２の第１記憶部２１に記憶される学習用データ２１１のデータ構造の一例を示す図である。学習用データ２１１は、データＩＤと、学習用多値画像と、学習用認識結果とが関連付けられたデータである。なお、学習用多値画像は、学習用第１画像の一例である。 (Data structure of various data)
FIG. 8 is a diagram showing an example of the data structure of the learning data 211 stored in the first storage unit 21 of the learning device 2. The learning data 211 is data in which a data ID, a learning multivalued image, and a learning recognition result are associated. Note that the multivalued image for learning is an example of the first image for learning.

データＩＤは、学習用多値画像と学習用認識結果との組み合わせを識別するための識別情報である。学習用多値画像には、画像を構成する各画素の階調値の情報が含まれる。図８に示す例では、各画素について、ＲＧＢの３チャネルのそれぞれについて０～２５５の階調値が記憶されている。学習用認識結果は、学習用多値画像に対して出力されるべきものとして予め設定された認識結果であり、対象の領域と対象の種別とを含む。対象の領域は、学習用多値画像において対象の像に外接する矩形領域を示す情報であり、例えば、矩形領域の中心座標並びに矩形領域の幅及び高さの情報である。対象の種別の情報は、対象の領域によって示される矩形領域に含まれる対象が、あらかじめ設定された複数の対象の種別の何れに該当するかを示す情報である。対象の種別は、例えば、該当する種別に対応する変数の値が１で、他の種別に対応する変数の値が０である、所謂one-hotベクトルである。なお、認識すべき対象の種別が一種類である場合、学習用認識結果は、対象の種別を含まなくてもよい。また、学習用多値画像に複数の対象が含まれる場合、各対象に対応する複数の対象の領域及び対象の種別の情報が含まれてもよい。 The data ID is identification information for identifying the combination of the learning multivalued image and the learning recognition result. The multivalued learning image includes information on the gradation value of each pixel making up the image. In the example shown in FIG. 8, tone values from 0 to 255 are stored for each of the three RGB channels for each pixel. The learning recognition result is a recognition result set in advance to be output for the learning multivalued image, and includes a target area and a target type. The target area is information indicating a rectangular area circumscribing the target image in the learning multilevel image, and is, for example, information about the center coordinates of the rectangular area and the width and height of the rectangular area. The target type information is information indicating which of a plurality of preset target types the target included in the rectangular area indicated by the target area corresponds to. The target type is, for example, a so-called one-hot vector in which the value of a variable corresponding to the relevant type is 1, and the value of variables corresponding to other types is 0. Note that when there is only one type of target to be recognized, the learning recognition result does not need to include the type of target. Further, when a multivalued learning image includes a plurality of objects, information on a plurality of object regions and object types corresponding to each object may be included.

学習用データ２１１は、あらかじめ学習装置２の管理者によって設定され、第１記憶部２１に記憶される。 The learning data 211 is set in advance by the administrator of the learning device 2 and stored in the first storage unit 21.

（処理の流れ）
図９は、学習装置２によって実行される学習処理の流れの一例を示すフロー図である。学習処理は、第１記憶部２１に記憶されたプログラムに従って、第１処理部２３が学習装置２の各構成要素と協働することにより実現される。 (Processing flow)
FIG. 9 is a flow diagram showing an example of the flow of learning processing executed by the learning device 2. As shown in FIG. The learning process is realized by the first processing unit 23 cooperating with each component of the learning device 2 according to a program stored in the first storage unit 21.

まず、学習用モデル取得手段２３１は、第１記憶部２１から学習用モデルを取得する（Ｓ１０１）。学習用モデルは、変換器の出力が認識器の入力となるように結合されたＣＮＮである。学習用モデル取得手段２３１は、取得された学習用モデルに含まれるフィルタの係数等のパラメータを、乱数等により初期化してもよい。 First, the learning model acquisition unit 231 acquires a learning model from the first storage unit 21 (S101). The learning model is a CNN that is coupled such that the output of the converter becomes the input of the recognizer. The learning model acquisition means 231 may initialize parameters such as filter coefficients included in the acquired learning model using random numbers or the like.

続いて、学習用データ取得手段２３２は、第１記憶部２１から学習用データ２１１を取得する（Ｓ１０２）。 Subsequently, the learning data acquisition means 232 acquires the learning data 211 from the first storage unit 21 (S102).

続いて、エッジ画像生成手段２３３は、学習用データ２１１に含まれる学習用多値画像からエッジ画像を生成する（Ｓ１０３）。エッジ画像は、エッジ画素の階調値と他の画素の階調値とが互いに異なる二値画像である。エッジ画像生成手段２３３は、学習用多値画像に対してＣａｎｎｙのエッジ検出方法を適用し、学習用多値画像からエッジ画素を検出する。エッジ画像生成手段２３３は、学習用多値画像において、検出されたエッジ画素の階調値を１に、他の画素の階調値を０に設定した画像をエッジ画像として生成する。 Subsequently, the edge image generation unit 233 generates an edge image from the learning multivalued image included in the learning data 211 (S103). An edge image is a binary image in which the tone values of edge pixels and the tone values of other pixels are different from each other. The edge image generation unit 233 applies Canny's edge detection method to the learning multi-value image to detect edge pixels from the learning multi-value image. The edge image generation unit 233 generates an image in which the tone value of the detected edge pixel is set to 1 and the tone values of other pixels are set to 0 in the learning multivalued image as an edge image.

なお、エッジ画像生成手段２３３は、ソーベルフィルタ等の公知のエッジ検出フィルタを用いてエッジ画像を生成してもよい。 Note that the edge image generation unit 233 may generate the edge image using a known edge detection filter such as a Sobel filter.

続いて、学習手段２３４は、学習用モデルに学習用多値画像を入力することにより、認識結果を生成する（Ｓ１０４）。認識結果は、学習用モデルから出力された対象物の領域及び対象物の種別である。認識結果は、学習用モデルのうちの変換器から出力された二値画像を含んでもよい。 Subsequently, the learning unit 234 generates a recognition result by inputting the learning multivalued image to the learning model (S104). The recognition result is the region of the object and the type of the object output from the learning model. The recognition result may include a binary image output from a converter of the learning model.

なお、学習手段２３４は、学習用モデルに、学習用多値画像にノイズを付加した画像を入力してもよい。これにより、学習装置２は、入力される多値画像にノイズが含まれていても適切に認識結果が出力されるように学習用モデルを学習させることができる。ただしこの場合、エッジ画像生成手段２３３は、ノイズを付加する前の学習用多値画像からエッジ画像を生成するのが良い。 Note that the learning unit 234 may input, to the learning model, an image obtained by adding noise to a multivalued learning image. Thereby, the learning device 2 can train the learning model to appropriately output recognition results even if the input multivalued image contains noise. However, in this case, it is preferable that the edge image generation unit 233 generates the edge image from the learning multivalued image before adding noise.

続いて、学習手段２３４は、生成された認識結果と学習用認識結果とに基づいて、誤差を算出する（Ｓ１０５）。誤差は、変換器の学習に用いられる、生成された認識結果と学習用認識結果との間の差の程度を示す指標であり、対象の領域に関する誤差と、対象の種別に関する誤差との重み付け和である誤差関数により算出される。対象の領域に関する誤差は、例えば、生成された認識結果の矩形領域と、学習用認識結果の矩形領域との間の中心座標、幅及び高さの二乗誤差又は対数二乗誤差等である。対象の種別に関する誤差は、例えば、生成された認識結果の対象の種別と、学習用認識結果の対象の種別との間の交差エントロピー誤差である。 Subsequently, the learning means 234 calculates an error based on the generated recognition result and the learning recognition result (S105). The error is an index that indicates the degree of difference between the generated recognition result and the training recognition result used for learning the converter, and is the weighted sum of the error related to the target area and the error related to the target type. It is calculated by the error function. The error regarding the target area is, for example, a square error or a logarithmic square error of the center coordinate, width, and height between the rectangular area of the generated recognition result and the rectangular area of the learning recognition result. The error related to the target type is, for example, a cross-entropy error between the target type of the generated recognition result and the target type of the learning recognition result.

誤差関数には、さらに二値画像に関する誤差が含まれてもよい。二値画像に関する誤差は、認識結果に含まれる、学習用モデルの変換器から出力された二値画像と、学習用多値画像から生成されたエッジ画像との二乗誤差である。これにより、学習装置２は、変換器によって出力される二値画像をエッジ画像に近づけ、且つ、学習用モデルによって出力される認識結果を学習用認識結果に近づけるように学習用モデルを学習させる。学習装置２は、変換器によって出力される二値画像をエッジ画像に近づけることにより、画像認識システム１のユーザが二値画像における対象の像を視認しやすくする。 The error function may further include errors regarding the binary image. The error regarding the binary image is a squared error between the binary image output from the learning model converter and the edge image generated from the learning multivalued image, which is included in the recognition result. Thereby, the learning device 2 causes the learning model to learn so that the binary image output by the converter approaches the edge image, and the recognition result output by the learning model approaches the learning recognition result. The learning device 2 makes it easier for the user of the image recognition system 1 to visually recognize the target image in the binary image by bringing the binary image output by the converter closer to the edge image.

二値画像に関する誤差は、二値画像と、エッジ画像をぼかした画像との二乗誤差でもよい。エッジ画像をぼかした画像は、エッジ画像に所定のフィルタ（例えば、ガウシアンフィルタ）を適用した画像である。また、二値画像に関する誤差は、二値画像のヒストグラムと、エッジ画像のヒストグラムとの二乗誤差でもよい。ヒストグラムは、例えば、各画像を所定のサイズの領域に区分した場合に、各領域に含まれる階調値が０である画素（又は、１である画素）の数を階級とし、各階級に対応する領域の数を度数とする度数分布である。ヒストグラムは、各画像における階調値の勾配の頻度を示すＨＯＧ（Histogram of Oriented Gradients）でもよい。 The error regarding the binary image may be a squared error between the binary image and the image with the edge image blurred. An image with a blurred edge image is an image obtained by applying a predetermined filter (for example, a Gaussian filter) to the edge image. Furthermore, the error regarding the binary image may be a squared error between the histogram of the binary image and the histogram of the edge image. For example, in a histogram, when each image is divided into regions of a predetermined size, the number of pixels whose gradation value is 0 (or pixels whose tone value is 1) included in each region is classified as a class, and the number of pixels corresponding to each class is This is a frequency distribution in which the frequency is the number of regions. The histogram may be a HOG (Histogram of Oriented Gradients) that indicates the frequency of gradients of tone values in each image.

二値画像とエッジ画像との間にエッジの位置や形状の微差があったとしても、そのような微差はユーザが二値画像における対象の像を視認する際には問題となりにくい。学習装置２は、エッジ画像をぼかした画像を用いることで、このようなエッジの位置や形状の微差を誤差関数に反映されにくくし、変換器の学習を容易にする。 Even if there is a slight difference in the position or shape of the edge between the binary image and the edge image, such a slight difference is unlikely to cause a problem when the user visually recognizes the target image in the binary image. By using an image in which the edge image is blurred, the learning device 2 makes it difficult for such slight differences in the position and shape of the edge to be reflected in the error function, thereby facilitating the learning of the converter.

また、畳み込み層と、畳み込み層の出力に基づく入力に対して活性化関数を適用する活性化層とが含まれる変換器の学習に用いられる誤差関数には、畳み込み層において適用されるフィルタの係数のノルムが含まれてもよい。フィルタの係数のノルムは、例えば、係数の二乗和（Ｌ２ノルム）又はフィルタのスペクトルノルムである。 In addition, the error function used for learning a transformer that includes a convolution layer and an activation layer that applies an activation function to an input based on the output of the convolution layer includes the coefficients of the filter applied in the convolution layer. The norm may be included. The norm of the coefficients of the filter is, for example, the sum of squares of the coefficients (L2 norm) or the spectral norm of the filter.

フィルタの係数のＬ２ノルムが大きい場合、変換器の畳み込み層において適用されるフィルタの係数の絶対値が大きいため、変換器の活性化層Ｐ１１３に入力される特徴マップの特徴量の絶対値も大きくなりやすい。この場合、活性化層Ｐ１１３により適用される活性化関数がシグモイド関数であれば、活性化層Ｐ１１３の出力の特徴量の多くは０に近い値又は１に近い値を有し、中間である０．５に近い値を有しない。このような特徴量を有する特徴マップが閾値処理層Ｐ１１４に入力された場合、閾値処理層Ｐ１１４から出力される画像の全ての画素の階調値が０又は１となる可能性が高くなり、認識器の学習が行われず、学習速度が低下する。 When the L2 norm of the filter coefficient is large, the absolute value of the filter coefficient applied in the convolution layer of the transformer is large, and therefore the absolute value of the feature quantity of the feature map input to the activation layer P113 of the transformer is also large. Prone. In this case, if the activation function applied by the activation layer P113 is a sigmoid function, most of the feature quantities output from the activation layer P113 have values close to 0 or close to 1, and 0 is in the middle. It does not have a value close to .5. When a feature map having such feature amounts is input to the threshold processing layer P114, there is a high possibility that the tone values of all pixels of the image output from the threshold processing layer P114 will be 0 or 1, and recognition Learning of the device is not performed, and the learning speed decreases.

また、スペクトルノルムは、畳み込み層に対する入力である複数の特徴マップのＬ２ノルムに対する、各入力に対応する出力である特徴マップのＬ２ノルムの比のうち、最大のものである。スペクトルノルムが大きい場合、畳み込み層の出力データの特徴量の絶対値が大きいため、同様に、閾値処理層Ｐ１１４の出力が全ての画素の階調値が０又は１である画像となる可能性が高くなり、学習速度が低下する。 Further, the spectral norm is the maximum ratio of the L2 norm of the feature map that is the output corresponding to each input to the L2 norm of the plurality of feature maps that are input to the convolutional layer. When the spectral norm is large, the absolute value of the feature amount of the output data of the convolutional layer is large, so there is a possibility that the output of the threshold processing layer P114 will be an image in which all pixels have tone values of 0 or 1. This increases the learning speed.

学習装置２は、誤差関数にＬ２ノルム又はスペクトルノルムを加えることにより、Ｌ２ノルム又はスペクトルノルムの値を小さくするようにＣＮＮを学習させる。これにより、学習装置２は、変換器から出力される二値画像の全ての画素の階調値が０又は１となる可能性を低減させ、学習速度を向上させることができる。なお、畳み込み層のフィルタのスペクトルノルムを誤差関数に加えるかわりに、スペクトルノルムが１となるように正規化したフィルタの係数を畳み込みで用いるようにしてもよい。 The learning device 2 makes the CNN learn to reduce the value of the L2 norm or the spectral norm by adding the L2 norm or the spectral norm to the error function. Thereby, the learning device 2 can reduce the possibility that the tone values of all pixels of the binary image output from the converter will be 0 or 1, and can improve the learning speed. Note that instead of adding the spectral norm of the filter of the convolutional layer to the error function, the coefficients of the filter normalized so that the spectral norm becomes 1 may be used in the convolution.

また、誤差関数には、変換器から出力される二値画像を構成する画素のうち、階調値が１である画素の割合（又は、階調値が０である画素の割合）が含まれてもよい。また、誤差関数には、変換器から出力される二値画像を構成する各画素と、各画素に隣接する画素との間の階調値の二乗誤差が含まれてもよい。このようにすることで、変換器から出力される二値画像を圧縮する場合に、その圧縮効率を向上させることができる。 Additionally, the error function includes the percentage of pixels whose gradation value is 1 (or the percentage of pixels whose gradation value is 0) among the pixels that constitute the binary image output from the converter. You can. Further, the error function may include a squared error in tone values between each pixel forming the binary image output from the converter and a pixel adjacent to each pixel. By doing so, it is possible to improve the compression efficiency when compressing the binary image output from the converter.

続いて、学習手段２３４は、ＣＮＮのパラメータを更新する（Ｓ１０６）。学習手段２３４は、誤差逆伝播法を用いてＣＮＮの各層の勾配を算出し、算出された勾配に基づく確率的勾配法により、誤差が小さくなるようにパラメータを更新する。更新されるパラメータは、畳み込み層において適用されるフィルタの係数並びに畳み込み層におけるバッチ正規化処理により補正された各特徴量の平均値及び分散値である。更新されるパラメータには、畳み込み層及び活性化層において活性化関数が適用される前に各特徴量に加えられるバイアス値が含まれてもよい。更新されるパラメータには、閾値処理層において適用される階段関数の閾値等が含まれてもよい。 Subsequently, the learning means 234 updates the parameters of the CNN (S106). The learning means 234 calculates the gradient of each layer of the CNN using the error backpropagation method, and updates the parameters so that the error becomes smaller using the stochastic gradient method based on the calculated gradient. The parameters to be updated are the coefficients of the filter applied in the convolution layer and the average value and variance value of each feature amount corrected by batch normalization processing in the convolution layer. The updated parameters may include bias values added to each feature before the activation function is applied in the convolution layer and the activation layer. The updated parameters may include a step function threshold applied in the threshold processing layer, and the like.

学習手段２３４は、変換器のパラメータを更新するための誤差逆伝播法を適用する際に、階段関数とは異なる他の関数の勾配を、変換器に含まれる、入力に対して階段関数を適用する閾値処理層Ｐ１１４の勾配として用いてもよい。他の関数は、勾配が０となる区間が階段関数よりも小さい関数であり、例えば、恒等関数又はシグモイド関数等である。このようにすることで、学習装置２は、誤差をより小さくするようにパラメータを更新し、学習速度を向上させることができる。すなわち、誤差逆伝播法においては、各層の勾配に基づいてその前の層の勾配を算出し、誤差の大きな要因となるパラメータを特定することによりパラメータを更新する。したがって、階段関数のように勾配が０である区間が支配的である関数を適用する層が存在する場合、その層より前の層において誤差の要因となるパラメータを特定することが難しくなる。学習装置２は、閾値処理層Ｐ１１４の勾配として、階段関数とは異なる、勾配が０となる区間が階段関数よりも小さい他の関数の勾配を用いることにより、誤差の要因となるパラメータの特定を容易にする。 When applying the error backpropagation method to update the parameters of the converter, the learning means 234 applies the step function to the input included in the converter, using the gradient of another function different from the step function. It may also be used as the gradient of the threshold processing layer P114. The other functions are functions in which the interval in which the gradient is 0 is smaller than the step function, such as an identity function or a sigmoid function. By doing so, the learning device 2 can update the parameters to further reduce the error and improve the learning speed. That is, in the error backpropagation method, the gradient of the previous layer is calculated based on the gradient of each layer, and the parameters are updated by identifying the parameter that causes a large error. Therefore, if there is a layer that applies a function such as a step function in which sections where the slope is 0 is dominant, it becomes difficult to identify parameters that cause errors in layers before that layer. The learning device 2 uses, as the gradient of the threshold processing layer P114, the gradient of another function different from the step function, in which the section where the gradient is 0 is smaller than the step function, thereby identifying the parameter that causes the error. make it easier.

続いて、学習手段２３４は、学習の終了条件が満たされたか否かを判定する（Ｓ１０７）。学習の終了条件は、例えば、所定回数以上パラメータが更新されたこと、又は、更新後のパラメータの更新前のパラメータに対する変化量が所定値以下であること等である。 Subsequently, the learning means 234 determines whether the learning end condition is satisfied (S107). The learning termination condition is, for example, that the parameter has been updated a predetermined number of times or more, or that the amount of change of the updated parameter with respect to the pre-update parameter is less than or equal to a predetermined value.

終了条件が満たされていないと判定された場合（Ｓ１０７－Ｎｏ）、学習手段２３４は、Ｓ１０２に処理を進める。終了条件が満たされていると判定された場合（Ｓ１０７－Ｙｅｓ）、学習手段２３４は、ＣＮＮを学習済みモデルとして第１記憶部２１に記憶し（Ｓ１０８）、一連の処理を終了する。 If it is determined that the termination condition is not satisfied (S107-No), the learning means 234 advances the process to S102. If it is determined that the termination condition is satisfied (S107-Yes), the learning means 234 stores the CNN as a trained model in the first storage unit 21 (S108), and ends the series of processing.

このように、学習装置２は、変換器及び認識器を同時学習により生成する。これにより、学習装置２は、変換器を、認識器による対象物の認識精度が高い二値画像を出力するように学習させることを可能とする。 In this way, the learning device 2 generates a converter and a recognizer through simultaneous learning. Thereby, the learning device 2 enables the converter to learn to output a binary image in which the recognizer has high recognition accuracy of the target object.

図１０は、画像認識システム１によって実行される画像認識処理の流れの一例を示すシーケンス図である。画像認識処理は、第１記憶部２１、第２記憶部３１及び第３記憶部４１に記憶されたプログラムに基づいて、第１処理部２３、第２処理部３４及び第３処理部４３が各装置の構成要素と協働することにより実現される。 FIG. 10 is a sequence diagram showing an example of the flow of image recognition processing executed by the image recognition system 1. The image recognition process is performed by the first processing section 23, the second processing section 34, and the third processing section 43 based on the programs stored in the first storage section 21, the second storage section 31, and the third storage section 41. This is achieved by cooperating with the components of the device.

まず、学習装置２の出力手段２３５は、第１通信部２２を介して、変換器及び識別器を撮像装置３及び認識装置４に対して出力する（Ｓ２０１）。出力手段２３５は、第１記憶部２１に記憶された学習済みモデルであるＣＮＮを分離することにより変換器及び認識器を生成する。出力手段２３５は、第１通信部２２を介して、変換器を撮像装置３に、認識器を認識装置４にそれぞれ送信する。撮像装置３は、変換器を受信して第２記憶部３１に記憶する。認識装置４は、認識器を受信して第３記憶部４１に記憶する。 First, the output means 235 of the learning device 2 outputs the converter and the discriminator to the imaging device 3 and the recognition device 4 via the first communication unit 22 (S201). The output unit 235 generates a converter and a recognizer by separating the CNN, which is a trained model stored in the first storage unit 21 . The output means 235 transmits the converter to the imaging device 3 and the recognizer to the recognition device 4 via the first communication unit 22 . The imaging device 3 receives the converter and stores it in the second storage unit 31. The recognition device 4 receives the recognizer and stores it in the third storage unit 41.

続いて、撮像装置３の撮像手段３４１は、撮像部３３を制御して、建物内の一室を撮像して多値画像を生成する（Ｓ２０２）。 Subsequently, the imaging means 341 of the imaging device 3 controls the imaging unit 33 to image one room in the building and generate a multivalued image (S202).

続いて、変換手段３４２は、生成された多値画像を二値画像に変換する（Ｓ２０３）。変換手段３４２は、第２記憶部３１に記憶された変換器に多値画像を入力し、二値画像を出力させることにより多値画像を二値画像に変換する。 Subsequently, the converting means 342 converts the generated multivalued image into a binary image (S203). The conversion means 342 converts the multivalued image into a binary image by inputting the multivalued image into a converter stored in the second storage unit 31 and outputting the binary image.

続いて、二値画像出力手段３４３は、第２通信部３２を介して、二値画像を伝送網６に対して出力する（Ｓ２０４）。二値画像出力手段３４３は、二値画像に所定の可逆圧縮技術を適用して出力してもよい。これにより、撮像装置３は、二値画像の伝送容量を抑えることができる。 Subsequently, the binary image output means 343 outputs the binary image to the transmission network 6 via the second communication unit 32 (S204). The binary image output means 343 may apply a predetermined reversible compression technique to the binary image and output it. Thereby, the imaging device 3 can suppress the transmission capacity of binary images.

続いて、認識装置４の二値画像取得手段４３１は、第３通信部４２を介して、二値画像を伝送網６から取得する（Ｓ２０５）。 Subsequently, the binary image acquisition means 431 of the recognition device 4 acquires the binary image from the transmission network 6 via the third communication unit 42 (S205).

続いて、認識手段４３２は、二値画像に対する認識結果を生成する（Ｓ２０６）。認識手段４３２は、第３記憶部４１に記憶された認識器に二値画像を入力し、認識結果を出力させることにより認識結果を生成する。 Subsequently, the recognition unit 432 generates a recognition result for the binary image (S206). The recognition unit 432 generates a recognition result by inputting the binary image to a recognizer stored in the third storage unit 41 and outputting the recognition result.

続いて、認識結果出力手段４３３は、第３通信部４２を介して、生成された認識結果を表示装置５に対して出力し（Ｓ２０７）、一連の処理を終了する。例えば、認識結果出力手段４３３は、表示装置５が認識結果に基づく認識結果画面７００を表示するための表示データを表示装置５に送信する。 Subsequently, the recognition result output means 433 outputs the generated recognition result to the display device 5 via the third communication unit 42 (S207), and the series of processing ends. For example, the recognition result output means 433 transmits display data to the display device 5 for the display device 5 to display a recognition result screen 700 based on the recognition result.

図１１は、表示装置５に表示される認識結果画面７００の一例を示す図である。認識結果画面７００は、二値画像７１０と、対象の像７１１と、外接矩形７２０と、種別表示オブジェクト７２１とを含む。 FIG. 11 is a diagram showing an example of a recognition result screen 700 displayed on the display device 5. As shown in FIG. The recognition result screen 700 includes a binary image 710, a target image 711, a circumscribed rectangle 720, and a type display object 721.

二値画像７１０は、撮像装置３によって生成された二値画像である。図１１に示す例では、階調値が１及び０である画素がそれぞれ黒及び白で示されている。対象の像７１１は、図１１に示す例では、人の全身画像である。外接矩形７２０は、認識装置４によって生成された対象の領域に基づいて表示される、対象の像７１１に外接する矩形のオブジェクトである。種別表示オブジェクト７２１は、認識装置４によって生成された対象の種別に基づいて表示される、対象の種別を文字等により示すオブジェクトである。種別表示オブジェクト７２１は、例えば、認識装置４によって生成された対象の種別に含まれる各変数のうち、最も値が大きい変数に対応する対象の種別を示す。 Binary image 710 is a binary image generated by imaging device 3. In the example shown in FIG. 11, pixels with gradation values of 1 and 0 are shown in black and white, respectively. In the example shown in FIG. 11, the target image 711 is a full-body image of a person. The circumscribed rectangle 720 is a rectangular object that circumscribes the target image 711 and is displayed based on the target area generated by the recognition device 4. The type display object 721 is an object that is displayed based on the type of the target generated by the recognition device 4 and indicates the type of the target using characters or the like. The type display object 721 indicates, for example, the type of target corresponding to the variable with the largest value among the variables included in the type of target generated by the recognition device 4.

以上説明したように、画像認識システム１において、学習装置２は、変換器の出力が認識器の入力となるように結合されたＣＮＮを学習させる。そして、撮像装置３は、学習済みモデルである変換器により多値画像を二値画像に変換し、認識装置４は、学習済みモデルである認識器により二値画像に対する認識結果を生成する。このようにすることで、画像認識システム１は、画像認識の精度を保ちながら、画像認識の対象である画像のデータ容量を安定して削減することを可能とする。 As described above, in the image recognition system 1, the learning device 2 trains a CNN that is coupled so that the output of the converter becomes the input of the recognizer. Then, the imaging device 3 converts the multivalued image into a binary image using a converter that is a trained model, and the recognition device 4 generates a recognition result for the binary image using a recognizer that is a trained model. By doing so, the image recognition system 1 makes it possible to stably reduce the data volume of the image that is the object of image recognition while maintaining the accuracy of image recognition.

なお、上述した説明では、変換器の入力は複数のチャネルを有する画像であるものとしたが、変換器の入力は、１チャネルの画像（例えば、グレースケール画像）でもよい。 Note that in the above description, the input to the converter is an image having a plurality of channels, but the input to the converter may be a one-channel image (for example, a grayscale image).

また、上述した説明では、変換器の出力は１チャネルの二値画像であるものとしたが、これに限られない。変換器は、入力である多値画像の各チャネルに対応する複数の二値画像を出力してもよい。例えば、入力である多値画像がＲＧＢの３チャネルを有する場合、変換器は、Ｒチャネルに対応する二値画像、Ｇチャネルに対応する二値画像、及び、Ｂチャネルに対応する二値画像をそれぞれ生成する。 Further, in the above description, it is assumed that the output of the converter is a one-channel binary image, but the present invention is not limited to this. The converter may output a plurality of binary images corresponding to each channel of the input multilevel image. For example, if the input multilevel image has three channels, RGB, the converter converts a binary image corresponding to the R channel, a binary image corresponding to the G channel, and a binary image corresponding to the B channel. Generate each.

この場合、認識器は、複数の二値画像を入力として受け付ける。また、エッジ画像生成手段２３３は、学習用多値画像の各チャネルの階調値に基づいて、各チャネルに対応するエッジ画像をそれぞれ生成する。このようにすることで、認識器の認識精度が向上する。 In this case, the recognizer accepts multiple binary images as input. Furthermore, the edge image generation means 233 generates edge images corresponding to each channel based on the gradation values of each channel of the learning multilevel image. By doing so, the recognition accuracy of the recognizer is improved.

また、変換器の出力は、変換器に入力される多値画像よりも小さい階調範囲の多値画像でもよい。これにより、認識器における認識精度が向上する。 Further, the output of the converter may be a multi-value image having a smaller gradation range than the multi-value image input to the converter. This improves recognition accuracy in the recognizer.

また、上述した説明では、画像認識システム１は、それぞれ１つの撮像装置３及び認識装置４を有するものとしたが、これに限られない。画像認識システム１は、複数の撮像装置３又は認識装置４を有してもよい。この場合、学習装置２は、複数の撮像装置３のそれぞれに変換器を出力し、又は、複数の認識装置４のそれぞれに認識器を出力する。 Furthermore, in the above description, the image recognition system 1 includes one imaging device 3 and one recognition device 4, but the present invention is not limited to this. The image recognition system 1 may include a plurality of imaging devices 3 or recognition devices 4. In this case, the learning device 2 outputs a converter to each of the plurality of imaging devices 3 or outputs a recognizer to each of the plurality of recognition devices 4.

また、撮像装置３又は認識装置４により学習装置２又は表示装置５の機能が実現されてもよい。 Further, the functions of the learning device 2 or the display device 5 may be realized by the imaging device 3 or the recognition device 4.

また、上述した説明では、物体の領域または物体の領域と種別を認識する認識器とそれに対応した変換器を例示したが、人の年齢や性別等の属性を認識する認識器とそれに対応した変換器であってもよいし、人又は車両の混雑度合い又は姿勢等の状態を認識する認識器とそれに対応した変換器であってもよく、種々の対象の画像認識に適用できる。なお、それらの場合、対象に応じた学習用認識結果を設定して学習を行うことになる。 In addition, in the above explanation, the recognizer that recognizes the area of the object or the area and type of the object and the corresponding converter were exemplified, but the recognizer that recognizes attributes such as a person's age and gender and the corresponding converter are exemplified. It may be a recognition device that recognizes conditions such as the degree of crowding or posture of people or vehicles, and a corresponding converter, and can be applied to image recognition of various objects. In these cases, learning is performed by setting learning recognition results according to the target.

当業者は、本発明の精神および範囲から外れることなく、様々な変更、置換及び修正をこれに加えることが可能であることを理解されたい。例えば、上述した各部の処理は、本発明の範囲において、適宜に異なる順序で実行されてもよい。また、上述した実施形態及び変形例は、本発明の範囲において、適宜に組み合わせて実施されてもよい。 It should be understood that those skilled in the art can make various changes, substitutions, and modifications thereto without departing from the spirit and scope of the invention. For example, the processing of each part described above may be executed in a different order as appropriate within the scope of the present invention. Furthermore, the embodiments and modifications described above may be implemented in appropriate combinations within the scope of the present invention.

１画像認識システム
２学習装置
２３１学習用モデル取得手段
２３２学習用データ取得手段
２３３エッジ画像生成手段
２３４学習手段
２３５出力手段
３撮像装置
３４１撮像手段
３４２変換手段
３４３二値画像出力手段
４認識装置
４３１二値画像取得手段
４３２認識手段
４３３認識結果出力手段 1 Image recognition system 2 Learning device 231 Learning model acquisition means 232 Learning data acquisition means 233 Edge image generation means 234 Learning means 235 Output means 3 Imaging device 341 Imaging means 342 Conversion means 343 Binary image output means 4 Recognition device 431 2 Value image acquisition means 432 Recognition means 433 Recognition result output means

Claims

Imaging means for imaging a space in which a predetermined object can be imaged to generate a first image consisting of pixels having gradation values within a first gradation range;
A converter that outputs, when the first image is input, a second image consisting of pixels having tone values within a second tone range smaller than the first tone range, converts the first image into a second image. converting means for converting the image into the second image;
a recognition unit that generates a recognition result for the second image by a recognizer that performs processing for recognizing the target on the second image and outputs the recognition result when the second image is input; , comprising ;
The converter and the recognizer are configured to include a learning array consisting of pixels having tone values within the first tone range, which are connected to a neural network such that the output of the converter becomes the input of the recognizer. A trained neural network that is trained to bring the recognition result output when one image is input closer to the learning recognition result set in advance as the recognition result to be output for the first learning image. be,
An image recognition system characterized by:

The converter and the recognizer are configured to select pixels having tone values within the second tone range that are output by the converter when the first training image is input to the combined neural network. a trained neural network that has been trained to bring an image closer to an edge image generated from the first learning image and to bring the recognition result output by the combined neural network closer to the learning recognition result. is,
The image recognition system according to claim 1 .

output means for outputting the converted second image to a predetermined transmission network;
acquisition means for acquiring the second image from the transmission network;
Furthermore,
The recognition means generates a recognition result for the acquired second image,
The image recognition system according to claim 1 or 2 .

Imaging means for imaging a space in which a predetermined object can be imaged to generate a first image consisting of pixels having gradation values within a first gradation range;
A converter that outputs, when the first image is input, a second image consisting of pixels having tone values within a second tone range smaller than the first tone range, converts the first image into a second image. converting means for converting the image into the second image;
output means for outputting the second image ,
The converter is configured such that the output of the converter is the input of a recognizer that performs processing on the second image to recognize the target when the second image is input and outputs a recognition result. When a first learning image consisting of pixels having a gradation value within the first gradation range is input to a neural network connected so that It is a trained neural network that has been trained to approximate the learning recognition result set in advance as the recognition result to be output for the
An imaging device characterized by:

When a first image consisting of pixels having tone values within a first tone range is input, from pixels having tone values within a second tone range smaller than the first tone range. an acquisition unit that acquires a second image obtained by converting the first image generated by imaging using a converter that outputs a second image;
Recognition means that generates a recognition result for the second image by a recognizer that performs processing for recognizing a predetermined object on the second image and outputs the recognition result when the second image is input. and ,
The recognizer inputs a first image for learning consisting of pixels having tone values within the first tone range to a neural network coupled such that the output of the converter becomes the input of the recognizer. is a trained neural network that has been trained so that the recognition result output when
A recognition device characterized by:

imaging a space in which a predetermined object can be imaged to generate a first image consisting of pixels having tone values within a first tone range;
A converter that outputs, when the first image is input, a second image consisting of pixels having tone values within a second tone range smaller than the first tone range, converts the first image into a second image. into the second image,
A recognition result for the second image is generated by a recognizer that performs processing for recognizing the target on the second image when the second image is input and outputs the recognition result. including,
The converter and the recognizer are configured to include a learning array consisting of pixels having tone values within the first tone range, which are connected to a neural network such that the output of the converter becomes the input of the recognizer. A trained neural network that is trained to bring the recognition result output when one image is input closer to the learning recognition result set in advance as the recognition result to be output for the first learning image. be,
An image recognition method characterized by: