JP7371466B2

JP7371466B2 - Image processing device

Info

Publication number: JP7371466B2
Application number: JP2019218770A
Authority: JP
Inventors: 篤志西田
Original assignee: Kyocera Document Solutions Inc
Current assignee: Kyocera Document Solutions Inc
Priority date: 2019-12-03
Filing date: 2019-12-03
Publication date: 2023-10-31
Anticipated expiration: 2039-12-03
Also published as: JP2021089512A

Description

本発明は、画像処理装置に関する。 The present invention relates to an image processing device.

特許文献１には、オブジェクトを含む入力画像を取得し、入力画像から、背景画像を用いて変化する領域の画像である変化領域画像を抽出し、入力画像と変化領域画像とを結合して畳込み型ニューラルネットワークを利用することにより、Ｒ，Ｇ，Ｂ，Ｏの特徴画像を抽出し、当該特徴画像からオブジェクトの位置を検出する技術が記載されている。 Patent Document 1 discloses that an input image including an object is acquired, a changing area image that is an image of a changing area is extracted from the input image using a background image, and the input image and changing area image are combined and folded. A technique is described in which a feature image of R, G, B, and O is extracted by using a built-in neural network, and the position of an object is detected from the feature image.

特開２０１７－１９１５０１公報Japanese Patent Application Publication No. 2017-191501

しかしながら、特許文献１に記載された情報処理装置では、特徴画像から抽出されたオブジェクトの位置の正確さが低いという問題がある。例えば、ある画像の種別を判定させた場合、当該画像のどの特徴部分に注目してこの判定を行ったかを正確に示すことは困難である。また、判定に用いた上記特徴部分が単数ではなく複数である場合には、どの特徴部分に注目してこの判定を行ったかを正確に示すことは更に困難になる。 However, the information processing device described in Patent Document 1 has a problem in that the accuracy of the position of the object extracted from the feature image is low. For example, when determining the type of a certain image, it is difficult to accurately indicate which characteristic part of the image was focused on in making this determination. Furthermore, if the number of characteristic parts used in the determination is not singular but plural, it becomes even more difficult to accurately indicate which characteristic part was focused on in making the determination.

本発明は上記課題に鑑みてなされたものであり、画像に含まれるオブジェクトに基づいて当該画像の種別を判定するときに、当該オブジェクトが単数及び複数のいずれの場合であっても、当該オブジェクトを的確に抽出して、当該画像の種別を判定する精度を高く保つことを目的とする。 The present invention has been made in view of the above-mentioned problems, and when determining the type of an image based on the object included in the image, regardless of whether the object is singular or plural, the object is The purpose is to maintain high accuracy in accurately extracting and determining the type of image.

本発明の一局面に係る画像処理装置は、生成部と、畳込みニューラルネットワーク部と、グラッドカム部と、記憶部と、比較部と、出力部と、補正部とを含む処理部を備え、前記生成部は、処理対象とされる画像データから濃度マップを生成し、前記畳込みニューラルネットワーク部は、前記濃度マップにフィルターをかけて特徴マップを生成し、当該特徴マップから分類データを生成する処理を行い、前記グラッドカム部は、活性化関数を用いて前記特徴マップからアクティベーションマップを生成する処理を行い、当該アクティベーションマップをクラスタリングにより複数のグループに分割し、当該複数のグループについてそれぞれ補正関数を算出し、前記記憶部は、教師データを記憶しており、前記比較部は、前記分類データと前記教師データとを用いて、前記分類データについての第１損失関数を算出する処理を行い、前記出力部は、前記比較部により算出された前記第１損失関数に前記補正関数をそれぞれ加算して、前記複数のグループ毎に第２損失関数を算出し、前記補正部は、前記出力部が算出した前記各第２損失関数を合計した補正値を用いて、前記畳込みニューラルネットワーク部で用いる前記フィルターの重み付けを補正し、前記補正部により作成された新規の前記フィルターを用いた前記畳込みニューラルネットワーク部による前記処理から、前記グラッドカム部、前記比較部、前記出力部、及び前記補正部による処理を繰り返すことで前記フィルターを補正して更新する、ものである。 An image processing device according to one aspect of the present invention includes a processing unit including a generation unit, a convolutional neural network unit, a GLAD cam unit, a storage unit, a comparison unit, an output unit, and a correction unit, The generation unit generates a density map from image data to be processed, and the convolutional neural network unit applies a filter to the density map to generate a feature map, and generates classification data from the feature map. The GRAD cam unit performs processing to generate an activation map from the feature map using an activation function, divides the activation map into a plurality of groups by clustering, and applies a correction function to each of the plurality of groups. , the storage unit stores training data, and the comparison unit performs a process of calculating a first loss function for the classification data using the classification data and the training data, The output unit calculates a second loss function for each of the plurality of groups by adding the correction function to the first loss function calculated by the comparison unit, and the correction unit calculates a second loss function for each of the plurality of groups. The weighting of the filter used in the convolutional neural network unit is corrected using a correction value that is the sum of the calculated second loss functions, and the convolution is performed using the new filter created by the correction unit. The filter is corrected and updated by repeating the processing by the neural network section, the grad cam section, the comparison section, the output section, and the correction section.

本発明によれば、画像に含まれるオブジェクトに基づいて当該画像の種別を判定するときに、当該オブジェクトが単数及び複数のいずれの場合であっても、当該オブジェクトを的確に抽出して、当該画像の種別を判定する精度を高く保つことができる。 According to the present invention, when determining the type of an image based on an object included in the image, regardless of whether the object is singular or plural, the object is accurately extracted and the image is The accuracy of determining the type of can be maintained high.

本発明に係る画像処理装置の一実施形態に係る画像形成装置の内部構成を示す図である。1 is a diagram showing an internal configuration of an image forming apparatus according to an embodiment of an image processing apparatus according to the present invention. 画像形成装置の電気的構成を示すブロック図である。FIG. 1 is a block diagram showing the electrical configuration of an image forming apparatus. （Ａ）は原稿を示す図、（Ｂ）は原稿を読み取って得た画像データから作成した濃度マップを示す図、（Ｃ）は濃度マップに対して用いるフィルターを示す図である。(A) is a diagram showing a manuscript, (B) is a diagram showing a density map created from image data obtained by reading the manuscript, and (C) is a diagram showing a filter used for the density map. （Ａ）～（Ｃ）は畳込みニューラルネットワーク部による処理を説明する図である。(A) to (C) are diagrams illustrating processing by a convolutional neural network unit. （Ａ）～（Ｃ）は畳込みニューラルネットワーク部による処理を説明する図である。(A) to (C) are diagrams illustrating processing by a convolutional neural network unit. （Ａ）～（Ｃ）は畳込みニューラルネットワーク部による処理を説明する図である。(A) to (C) are diagrams illustrating processing by a convolutional neural network unit. 活性化関数ＲｅＬＵを説明する図である。It is a figure explaining activation function ReLU. アクティベーションマップの一例を示す図である。FIG. 3 is a diagram showing an example of an activation map. （Ａ）は教師データ、分類データ、及び差分データの一例を示す図、（Ｂ）は畳込みニューラルネットワーク部が用いるフィルターが補正されていく変遷を示す図である。(A) is a diagram showing an example of teacher data, classification data, and difference data, and (B) is a diagram showing a transition in which the filter used by the convolutional neural network unit is corrected. （Ａ）～（Ｃ）は、教師データ、分類データ、及び差分データの一例を示す図である。(A) to (C) are diagrams showing examples of teacher data, classification data, and difference data. （Ａ）～（Ｃ）は、フィルターが補正される度にアクティベーションマップが変化していく変遷を示す図である。(A) to (C) are diagrams showing how the activation map changes each time the filter is corrected.

以下、本発明の一実施形態に係る画像処理装置及び画像処理方法について、図面を参照しながら説明する。なお、以下の説明において、同一又は近似する各部については同一の符号を付し、繰り返しの説明は省略する。 DESCRIPTION OF THE PREFERRED EMBODIMENTS An image processing apparatus and an image processing method according to an embodiment of the present invention will be described below with reference to the drawings. In the following description, the same or similar parts will be denoted by the same reference numerals, and repeated description will be omitted.

図１を参照して、本発明の一実施形態に係る画像処理装置について説明する。図１は、本発明に係る画像処理装置の一実施形態に係る画像形成装置の内部構成を示す図である。 An image processing apparatus according to an embodiment of the present invention will be described with reference to FIG. FIG. 1 is a diagram showing the internal configuration of an image forming apparatus according to an embodiment of an image processing apparatus according to the present invention.

画像形成装置１００は、複写機、プリンター、及びファクシミリの機能を兼ね備えた複合機である。 The image forming apparatus 100 is a multifunction device that has the functions of a copying machine, a printer, and a facsimile.

図１に示すように、画像形成装置１００は、原稿搬送部２と、読取部４と、給送部６と、搬送部８と、画像形成部１０と、排出部１２と、処理部２０とを備える。更に、画像形成装置１００は、操作部５及びネットワークインターフェイス部９１を備える（図２）。 As shown in FIG. 1, the image forming apparatus 100 includes a document transport section 2, a reading section 4, a feeding section 6, a transport section 8, an image forming section 10, an ejection section 12, and a processing section 20. Equipped with Furthermore, the image forming apparatus 100 includes an operation section 5 and a network interface section 91 (FIG. 2).

原稿搬送部２は、トレイ８０に配置された原稿Ｇを搬送する。原稿搬送部２は、ピックアップローラー８２と、複数の搬送ローラー８４とを含んでよい。また、原稿搬送部２の一例は、ＡＤＦ（ＡｕｔｏＤｏｃｕｍｅｎｔＦｅｅｄｅｒ）である。原稿Ｇは、紙、またはプロジェクターに用いるプラスチックシートである。 The document transport unit 2 transports the document G placed on the tray 80. The document transport section 2 may include a pickup roller 82 and a plurality of transport rollers 84. Further, an example of the document transport unit 2 is an ADF (Auto Document Feeder). The original G is paper or a plastic sheet used in a projector.

読取部４は、原稿Ｇの画像を読み取る。読取部４は、画像を読み取って画像データを生成する。読取部４は、光源８６と、複数の反射ミラー８８と、レンズ９０と、撮像部９２とを含む。読取部４の一例は、スキャナーである。 The reading unit 4 reads the image of the document G. The reading unit 4 reads an image and generates image data. The reading section 4 includes a light source 86, a plurality of reflective mirrors 88, a lens 90, and an imaging section 92. An example of the reading unit 4 is a scanner.

光源８６は、複数の発光素子を有する。発光素子は、例えば、発光ダイオード（ＬａｓｅｒＥｍｉｔｔｉｎｇＤｉｏｄｅ：ＬＥＤ）である。光源８６から出射された光は、原稿搬送部２を搬送される原稿Ｇによって反射した後、複数の反射ミラー８８で反射されて、レンズ９０を通り、撮像部９２に到達する。 Light source 86 has a plurality of light emitting elements. The light emitting element is, for example, a light emitting diode (LED). The light emitted from the light source 86 is reflected by the original G being transported through the original transporting section 2 , is reflected by a plurality of reflecting mirrors 88 , passes through a lens 90 , and reaches the imaging section 92 .

撮像部９２は、レンズ９０から光を受け取る複数の受光素子を有している。撮像部９２は、例えば、電荷結合素子（ＣｈａｒａｇｅＣｏｕｐｌｅｄＤｅｖｉｃｅ：ＣＣＤ）である。撮像部９２は、撮像部９２に到達した光からアナログ電気信号を生成する。その後、Ａ／Ｄ変換部（図示せず）において、当該アナログ信号がデジタル信号に変換され、このデジタル信号により画像データが構成される。読取部４は、当該画像データを処理部２０に出力する。 The imaging unit 92 includes a plurality of light receiving elements that receive light from the lens 90. The imaging unit 92 is, for example, a charge coupled device (CCD). The imaging section 92 generates an analog electrical signal from the light that has reached the imaging section 92 . Thereafter, in an A/D converter (not shown), the analog signal is converted into a digital signal, and the digital signal constitutes image data. The reading section 4 outputs the image data to the processing section 20.

給送部６は、複数のシートＳを収容し、搬送部８へシートＳを給送する。シートＳは、例えば、紙製または合成樹脂製である。搬送部８は、複数の搬送ローラー対を含み、画像形成部１０にシートＳを搬送する。 The feeding section 6 accommodates a plurality of sheets S, and feeds the sheets S to the conveying section 8. The sheet S is made of paper or synthetic resin, for example. The conveyance section 8 includes a plurality of conveyance roller pairs, and conveys the sheet S to the image forming section 10.

画像形成部１０は、電子写真方式によってシートＳにトナー像を形成する。具体的には、画像形成部１０は、感光体ドラムと、帯電装置と、露光装置と、現像装置と、補給装置と、転写装置と、クリーニング装置と、除電装置とを含む。 The image forming section 10 forms a toner image on the sheet S using an electrophotographic method. Specifically, the image forming section 10 includes a photosensitive drum, a charging device, an exposure device, a developing device, a replenishing device, a transfer device, a cleaning device, and a static eliminator.

トナー像は、例えば、原稿Ｇの画像を示す。排出部１２は、画像形成装置１００の外部にシートＳを排出する。 The toner image shows, for example, an image of a document G. The discharge unit 12 discharges the sheet S to the outside of the image forming apparatus 100.

次に、図２を参照して、本実施形態に係る画像形成装置１００の電気的構成を説明する。図２は、本実施形態に係る画像形成装置１００の電気的構成を示すブロック図である。図３（Ａ）は原稿を示す図、（Ｂ）は原稿を読み取って得た画像データから作成した濃度マップを示す図、（Ｃ）は濃度マップに対して用いるフィルターを示す図である。 Next, with reference to FIG. 2, the electrical configuration of the image forming apparatus 100 according to this embodiment will be described. FIG. 2 is a block diagram showing the electrical configuration of the image forming apparatus 100 according to this embodiment. 3A is a diagram showing a document, FIG. 3B is a diagram showing a density map created from image data obtained by reading the document, and FIG. 3C is a diagram showing a filter used for the density map.

処理部２０は、プロセッサー、ＲＡＭ（Random Access Memory）、ＲＯＭ（Read Only Memory）、及び専用のハードウェア回路を含んで構成される。プロセッサーは、例えばＣＰＵ（Central Processing Unit）、ＡＳＩＣ（Application Specific Integrated Circuit）、又はＭＰＵ（Micro Processing Unit）等である。処理部２０は、生成部２２と、畳込みニューラルネットワーク部２４と、グラッドカム部２６と、記憶部２８と、比較部３０と、出力部３２と、補正部３４と、分類部３６とを備えている。 The processing unit 20 includes a processor, a RAM (Random Access Memory), a ROM (Read Only Memory), and a dedicated hardware circuit. The processor is, for example, a CPU (Central Processing Unit), an ASIC (Application Specific Integrated Circuit), or an MPU (Micro Processing Unit). The processing section 20 includes a generation section 22, a convolutional neural network section 24, a GLAD cam section 26, a storage section 28, a comparison section 30, an output section 32, a correction section 34, and a classification section 36. There is.

処理部２０は、画像形成装置１００が備えるＨＤＤ１１１又は上記ＲＯＭに記憶されている制御プログラムに従った上記プロセッサーによる動作により、生成部２２と、畳込みニューラルネットワーク部２４と、グラッドカム部２６と、記憶部２８と、比較部３０と、出力部３２と、補正部３４と、分類部３６として機能するものである。但し、当該生成部２２～分類部３６は、上記プロセッサーによる制御プログラムに従った動作によらず、それぞれハードウェア回路により構成することも可能である。 The processing section 20 generates the generation section 22, the convolutional neural network section 24, the GLAD cam section 26, and the memory by the operation of the processor according to the control program stored in the HDD 111 included in the image forming apparatus 100 or the ROM. It functions as a section 28, a comparison section 30, an output section 32, a correction section 34, and a classification section 36. However, the generating section 22 to the classifying section 36 may be configured by hardware circuits instead of operating according to the control program by the processor.

生成部２２、畳込みニューラルネットワーク部２４、グラッドカム部２６、記憶部２８、比較部３０、出力部３２、及び補正部３４は、例えば読取部４による原稿読取で得られた画像データに対して、以下に示す処理を行う。例えば、原稿Ｇは、図３（Ａ）に示すように、表題に法人名が記載された文書である。原稿Ｇの表題には、例えば、「ＡＢＣＤ株式会社」と記載された画像部分Ａ１を有し、更に、文末に「ＡＢＣＤ株式会社」がもう一度記載された画像部分Ａ２を有している。 The generation unit 22, the convolutional neural network unit 24, the GLAD cam unit 26, the storage unit 28, the comparison unit 30, the output unit 32, and the correction unit 34 process image data obtained by reading a document by the reading unit 4, for example. Perform the processing shown below. For example, as shown in FIG. 3A, manuscript G is a document in which a corporate name is written in the title. The title of the manuscript G includes, for example, an image portion A1 in which "ABCD Corporation" is written, and an image portion A2 in which "ABCD Corporation" is written once again at the end of the sentence.

生成部２２は、上記画像データから濃度マップ５０（図３（Ｂ））を生成する。 The generation unit 22 generates a density map 50 (FIG. 3(B)) from the image data.

畳込みニューラルネットワーク部２４は、濃度マップ５０にフィルター５２（図３（Ｃ））をかけて、第１特徴マップ５４（図４（Ｃ））を生成する処理を行う。更に、畳込みニューラルネットワーク部２４は、第１特徴マップ５４から第２特徴マップ（図５（Ｃ））を生成する処理を行う。更に、畳込みニューラルネットワーク部２４は、第２特徴マップから分類データを生成する処理を行う。 The convolutional neural network unit 24 performs a process of applying a filter 52 (FIG. 3(C)) to the density map 50 to generate a first feature map 54 (FIG. 4(C)). Further, the convolutional neural network unit 24 performs processing to generate a second feature map (FIG. 5(C)) from the first feature map 54. Further, the convolutional neural network unit 24 performs a process of generating classification data from the second feature map.

グラッドカム部２６は、Ｇｒａｄ－ｃａｍ処理を行うことにより、第２特徴マップ５６からアクティベーションマップ６６を生成する。 The Grad-cam unit 26 generates an activation map 66 from the second feature map 56 by performing Grad-cam processing.

更にグラッドカム部２６は、生成したアクティベーションマップ６６を複数のグループに分割する。グラッドカム部２６は、アクティベーションマップ６６を複数のグループに分割する処理を、クラスタリング（例えば、k-means）により行う。本実施形態では、グラッドカム部２６は、アクティベーションマップ６６に対するk-meansによるクラスタリングを行って、アクティベーションマップ６６を複数に分割する例を説明する。例えば、グラッドカム部２６は、アクティベーションマップ６６から、クラスタ毎に、濃度が濃い中心点を検出し、当該中心点から予め定められた一定の距離内に存在する各点の集合を１つのグループとする分割処理を行う。 Furthermore, the GLAD cam unit 26 divides the generated activation map 66 into a plurality of groups. The GLAD cam unit 26 performs a process of dividing the activation map 66 into a plurality of groups using clustering (for example, k-means). In this embodiment, an example will be described in which the GLAD cam unit 26 performs clustering using k-means on the activation map 66 and divides the activation map 66 into a plurality of parts. For example, the GLAD cam unit 26 detects a center point with a high density for each cluster from the activation map 66, and defines a set of points existing within a predetermined distance from the center point as one group. Perform the dividing process.

なお、グラッドカム部２６は、上記画像データに対して、上記グループ数が推定されるx-meansでのクラスタリングを行って、推定されたグループ数にアクティベーションマップ６６を分割してもよい。 Note that the GLAD cam unit 26 may perform clustering on the image data using x-means in which the number of groups is estimated, and divide the activation map 66 into the estimated number of groups.

記憶部２８は、例えばＨＤＤ又はメモリーであり、教師データ６０を記憶している。本実施形態では、処理部２０により、原稿Ｇ（図３（Ａ））がＡＢＣＤ株式会社宛に作成された文書であるのかを判定する場合を例にする。このため、例えば、「ＡＢＣＤ株式会社」の文字を示す見本画像に対して、生成部２２及び畳込みニューラルネットワーク部２４による処理を行って分類データを生成し、この生成された分類データを教師データとする。 The storage unit 28 is, for example, an HDD or a memory, and stores the teacher data 60. In this embodiment, a case will be described in which the processing unit 20 determines whether the document G (FIG. 3A) is a document created for ABCD Corporation. For this reason, for example, the generation unit 22 and the convolutional neural network unit 24 perform processing on a sample image showing the characters "ABCD Corporation" to generate classification data, and use the generated classification data as the teacher data. shall be.

比較部３０は、分類データ６２と、教師データ６０とを比較し、差分データ６４を算出する。また、比較部３０は、分類データ６２と教師データ６０を用いて第１損失関数（Ｌｏｓｓｆｕｎｃｔｉｏｎ）を算出する。 The comparison unit 30 compares the classification data 62 and the teacher data 60 and calculates difference data 64. Furthermore, the comparison unit 30 calculates a first loss function using the classification data 62 and the teacher data 60.

出力部３２は、上記第１損失関数から更に第２損失関数（Ｌｏｓｓｆｕｎｃｔｉｏｎ）を算出する。 The output unit 32 further calculates a second loss function from the first loss function.

補正部３４は、出力部３２から出力されてくる上記第２損失関数を用いてフィルター５２の重み付けを補正する。 The correction unit 34 corrects the weighting of the filter 52 using the second loss function outputted from the output unit 32.

分類部３６は、画像判定処理を行う。分類部３６は、上記のように補正部３４により補正されて更新された最新のフィルター５２を用いて畳込みニューラルネットワーク部２４により生成された分類データが示す各値の配列によって、画像の種別を判定する。 The classification unit 36 performs image determination processing. The classification unit 36 determines the type of image based on the array of values indicated by the classification data generated by the convolutional neural network unit 24 using the latest filter 52 corrected and updated by the correction unit 34 as described above. judge.

操作部５は、ユーザーから各種操作指示の入力を受け付ける。 The operation unit 5 receives input of various operation instructions from the user.

ネットワークインターフェイス部９１は、図略のＬＡＮチップなどの通信モジュールを備える通信インターフェイスである。ネットワークインターフェイス部９１は、ローカルエリア内、又はインターネット上の外部装置と種々のデータの送受信を行う。 The network interface unit 91 is a communication interface including a communication module such as a LAN chip (not shown). The network interface unit 91 sends and receives various data to and from external devices within the local area or on the Internet.

次に、図２に加えて更に図３乃至図１１を参照して、本実施形態に係る画像形成装置１００の具体例を説明する。 Next, a specific example of the image forming apparatus 100 according to the present embodiment will be described with reference to FIGS. 3 to 11 in addition to FIG. 2.

まず、図３（Ａ）～（Ｃ）を参照して、畳込みニューラルネットワーク部２４の処理の概略を説明する。畳込みニューラルネットワーク部２４は、畳込み処理とプーリング処理とを繰り返し、分類データ（図９（Ａ））を得る処理を行う。 First, an outline of the processing of the convolutional neural network unit 24 will be explained with reference to FIGS. 3(A) to 3(C). The convolutional neural network unit 24 repeats convolution processing and pooling processing to obtain classification data (FIG. 9(A)).

畳込みニューラルネットワーク部２４は、生成部２２により生成された濃度マップ５０（図３（Ｂ））に、予め記憶しているフィルター５２（図３（Ｃ））を掛け、図４（Ｃ）に示す第１特徴マップ５４を生成し、更に第１特徴マップ５４から第２特徴マップ５６を生成する処理を行う。 The convolutional neural network unit 24 multiplies the density map 50 (FIG. 3(B)) generated by the generating unit 22 by a pre-stored filter 52 (FIG. 3(C)), and generates the image in FIG. 4(C). A first feature map 54 shown in FIG. 1 is generated, and a second feature map 56 is further generated from the first feature map 54.

続けて、畳込みニューラルネットワーク部２４は、プーリング処理により、第２特徴マップ５６から、予め定められたマトリクス（この実施形態では２×２を例にして説明する）ごとに代表値５６Ｃ、代表値５６Ｄを抽出する（図６（Ａ）（Ｂ））。畳込みニューラルネットワーク部２４は、(i)当該プーリング処理を繰り返して第２特徴マップ５６を小さくし、更に、このように小さくした第２特徴マップ５６を、一次的に並ぶ一列の予め定められた画素数のデータに展開する、(ii) 第２特徴マップ５６を全結合層により、一次的に並ぶ一列の予め定められた画素数のデータに展開する、等の処理により、図９（Ａ）に示す分類データ６２を得る。分類データ６２については、図９（Ａ）を参照して後述する。 Subsequently, the convolutional neural network unit 24 performs pooling processing to extract representative values 56C and representative values from the second feature map 56 for each predetermined matrix (in this embodiment, 2×2 will be explained as an example). 56D is extracted (FIG. 6(A)(B)). The convolutional neural network unit 24 (i) repeats the pooling process to reduce the size of the second feature map 56, and further divides the thus reduced second feature map 56 into a predetermined line in a primary line. (ii) Developing the second feature map 56 into data with a predetermined number of pixels in a linear row using a fully connected layer, etc., as shown in FIG. 9(A). Classification data 62 shown in is obtained. The classification data 62 will be described later with reference to FIG. 9(A).

グラッドカム部２６は、Ｇｒａｄ－ｃａｍ処理を行うことにより、第２特徴マップ５６からアクティベーションマップ６６を生成する。更にグラッドカム部２６は、生成したアクティベーションマップ６６を複数のグループに分割する。この分割処理により、グラッドカム部２６は、例えば、図３（Ａ）に示した原稿Ｇの画像における画像部分Ａ１を含むグループと、画像部分Ａ２を含むグループとに、アクティベーションマップ６６を分割する。 The Grad-cam unit 26 generates an activation map 66 from the second feature map 56 by performing Grad-cam processing. Furthermore, the GLAD cam unit 26 divides the generated activation map 66 into a plurality of groups. Through this division process, the GLAD cam unit 26 divides the activation map 66 into a group including the image portion A1 and a group including the image portion A2 in the image of the document G shown in FIG. 3A, for example.

ここで、畳込みニューラルネットワーク部２４による上記畳込み処理を更に詳細に説明する。読取部４（図２）は原稿Ｇを読み取り、図３（Ａ）に例を示す画像データを生成する。 Here, the above convolution processing by the convolutional neural network unit 24 will be explained in more detail. The reading unit 4 (FIG. 2) reads the original G and generates image data, an example of which is shown in FIG. 3(A).

そして、生成部２２は、原稿Ｇ全体を示す画像について濃度マップ５０（図３（Ｂ））を生成する。濃度マップ５０は、画素ｘ１１、画素ｘ１２、画素ｘ１３、・・・、画素ｘ２１、画素ｘ２２、画素ｘ２３、・・・、画素ｘ３１、画素ｘ３２、画素ｘ３３、・・・、画素ｘｉｊ、・・・がマトリクス状に配列されたものである。それぞれの画素ｘｉｊには、濃度情報が付与されている。 Then, the generation unit 22 generates a density map 50 (FIG. 3(B)) for the image showing the entire document G. The density map 50 includes pixels x11, pixel x12, pixel x13, ..., pixel x21, pixel x22, pixel x23, ..., pixel x31, pixel x32, pixel x33, ..., pixel xij, ... are arranged in a matrix. Density information is given to each pixel xij.

なお、一例として、画素ｘｉｊの画素数は２桁であるが、２桁に限られず、１桁でもよく、３桁以上であってもよい。また、以降、画素番号を特定する必要がない場合は、画素ｘｉｊと記載する。本実施形態においては、説明を簡略にするため、濃度マップ５０は、画素ｘ１１～画素ｘ９９の９×９のマトリクスとする。ｉおよびｊは、正の整数である。フィルター５２は、画素ｗ１１、画素ｗ１２、画素ｗ１３、画素ｗ２１、画素ｗ２２、画素ｗ２３、画素ｗ３１、画素ｗ３２、および画素ｗ３３の画素ｗｉｊが例えば３×３のマトリクス状に配列されている。 Note that, as an example, the number of pixels of pixel xij is two digits, but it is not limited to two digits, and may be one digit, or may be three or more digits. Further, hereinafter, if there is no need to specify the pixel number, it will be described as pixel xij. In this embodiment, to simplify the explanation, the density map 50 is assumed to be a 9×9 matrix of pixels x11 to x99. i and j are positive integers. In the filter 52, pixels wij of pixel w11, pixel w12, pixel w13, pixel w21, pixel w22, pixel w23, pixel w31, pixel w32, and pixel w33 are arranged in a 3×3 matrix, for example.

まず、畳込みニューラルネットワーク部２４は、濃度マップ５０の畳込みを行う。具体的には、図４（Ａ）に示すように、畳込みニューラルネットワーク部２４は、フィルター５２を濃度マップ５０に掛け合わせる。まず、畳込みニューラルネットワーク部２４は、フィルター５２の画素ｗ１１～画素ｗ３３を、濃度マップ５０の画素ｘ１１～画素ｘ３３に掛け合わせる。 First, the convolutional neural network unit 24 convolves the density map 50. Specifically, as shown in FIG. 4A, the convolutional neural network unit 24 multiplies the density map 50 by a filter 52. First, the convolutional neural network unit 24 multiplies pixels w11 to w33 of the filter 52 to pixels x11 to x33 of the density map 50.

続いて、畳込みニューラルネットワーク部２４は、フィルター５２を濃度マップ５０で列番号が増える方向に１列ずらし、フィルター５２の画素ｗ１１～画素ｗ３３を、濃度マップ５０の画素ｘ１２～画素ｘ３４に掛け合わせる。以降、同様に、畳込みニューラルネットワーク部２４は、フィルター５２を濃度マップ５０の画素ｘ１１～画素ｘ３３のマトリクスから画素ｘ７７～画素ｘ９９のマトリクスまで順次掛け合わせる。 Next, the convolutional neural network unit 24 shifts the filter 52 by one column in the direction of increasing column numbers in the density map 50, and multiplies the pixels w11 to w33 of the filter 52 to the pixels x12 to x34 of the density map 50. . Thereafter, similarly, the convolutional neural network unit 24 sequentially multiplies the filter 52 from the matrix of pixels x11 to x33 of the density map 50 to the matrix of pixels x77 to x99.

すなわち、畳込みニューラルネットワーク部２４は、上記のようにフィルター５２を濃度マップ５０に掛け合わせることで、図４（Ｂ）に示す第１特徴データｙｉｊ＝ｘｉｊ×ｗｉｊの行列式を得る。ｉおよびｊは、正の整数である。具体的には、畳込みニューラルネットワーク部２４は、フィルター５２の画素ｗ１１～画素ｗ３３を濃度マップ５０の画素ｘ１１～画素ｘ３３に掛け合わせて、第１特徴データｙ１１＝ｘ１１ｗ１１＋ｘ１２ｗ１２＋ｘ１３ｗ１３＋ｘ２１ｗ２１＋ｘ２２ｗ２２＋ｘ２３ｗ２３＋ｘ３１ｗ３１＋ｘ３２ｗ３２＋ｘ３３ｗ３３を得る。次に、フィルター５２の画素ｗ１１～画素ｗ３３を、濃度マップ５０の画素ｘ１２～画素ｘ３４に掛け合わせて、第１特徴データｙ１２＝ｘ１２ｗ１１＋ｘ１３ｗ１２＋ｘ１４ｗ１３＋ｘ２２ｗ２１＋ｘ２３ｗ２２＋ｘ２４ｗ２３＋ｘ３２ｗ３１＋ｘ３３ｗ３２＋ｘ３４ｗ３３を得る。以降、同様に、畳込みニューラルネットワーク部２４は、フィルター５２を濃度マップ５０の画素ｘ７７～画素ｘ９９のマトリクスまで順次掛け合わせる。 That is, the convolutional neural network unit 24 multiplies the density map 50 by the filter 52 as described above to obtain the determinant of the first feature data yij=xij×wij shown in FIG. 4(B). i and j are positive integers. Specifically, the convolutional neural network unit 24 multiplies pixels w11 to w33 of the filter 52 to pixels x11 to x33 of the density map 50 to obtain first feature data y11=x11w11+x12w12+x13w13+x21w21+x22w22+x23w23+x31w31+x32w32+x33w33. get Next, pixels w11 to w33 of the filter 52 are multiplied by pixels x12 to x34 of the density map 50 to obtain first feature data y12=x12w11+x13w12+x14w13+x22w21+x23w22+x24w23+x32w31+x33w32+x34w33. Thereafter, similarly, the convolutional neural network unit 24 sequentially multiplies the matrix of pixels x77 to x99 of the density map 50 by the filter 52.

そして、畳込みニューラルネットワーク部２４は、濃度マップ５０にフィルター５２を掛け合わせて得た第１特徴データｙ１１、第１特徴データｙ１２、・・・、第１特徴データｙｉｊ、・・・、第１特徴データｙ７７によって構成されるマトリクス状に配置した第１特徴マップ５４（図４（Ｃ））を得る。これにより、畳込みニューラルネットワーク部２４は、９×９の濃度マップ５０を、７×７の第１特徴マップ５４にする畳込み処理を行ったことになる。 The convolutional neural network unit 24 then generates first feature data y11, first feature data y12, . . . , first feature data yij, . A first feature map 54 (FIG. 4(C)) arranged in a matrix formed by the feature data y77 is obtained. As a result, the convolutional neural network unit 24 has performed convolution processing from the 9×9 density map 50 to the 7×7 first feature map 54.

更に図５（Ａ）～（Ｃ）を参照して、畳込みニューラルネットワーク部２４による畳込み処理の具体例を説明する。図５（Ａ）に示す濃度マップ５０は、図３（Ｂ）に示す濃度マップ５０の具体例である。図５（Ａ）に示すフィルター５２は、図３（Ｂ）に示すフィルター５２の具体例である。 Furthermore, with reference to FIGS. 5(A) to 5(C), a specific example of convolution processing by the convolutional neural network unit 24 will be described. The density map 50 shown in FIG. 5(A) is a specific example of the density map 50 shown in FIG. 3(B). The filter 52 shown in FIG. 5(A) is a specific example of the filter 52 shown in FIG. 3(B).

ここでは、濃度マップ５０を分かりやすく説明するために、濃度マップ５０を構成する各画素の値を２値「１」及び「－１」のいずれかで示した例を用いて説明する。また、フィルター５２も、２値「１」及び「－１」のいずれかで示した例を用いて説明する。 Here, in order to explain the density map 50 in an easy-to-understand manner, an example will be described in which the value of each pixel constituting the density map 50 is shown as either a binary value of "1" or "-1". Further, the filter 52 will also be explained using an example shown as either a binary value "1" or "-1".

図５（Ａ）に示す濃度マップ５０は、画素ｘ１１＝－１、画素ｘ１２＝－１、画素ｘ１３＝－１、画素ｘ２１＝－１、画素ｘ２２＝１、画素ｘ２３＝１、画素ｘ３１＝－１、画素ｘ３２＝－１、および画素ｘ３３＝－１…とした例を示す。 The density map 50 shown in FIG. 5A has pixel x11=-1, pixel x12=-1, pixel x13=-1, pixel x21=-1, pixel x22=1, pixel x23=1, pixel x31=- 1, pixel x32=-1, pixel x33=-1, and so on.

図５（Ａ）に示すフィルター５２は、図３（Ｃ）に示すフィルター５２の画素ｗｉｊを２値の「１」か「－１」のいずれかで表す。一例として、画素ｗ１１＝１、画素ｗ１２＝－１、画素ｗ１３＝－１、画素ｗ２１＝－１、画素ｗ２２＝１、画素ｗ２３＝－１、画素ｗ３１＝－１、画素ｗ３２＝１、および画素ｗ３３＝－１である。 The filter 52 shown in FIG. 5(A) represents the pixel wij of the filter 52 shown in FIG. 3(C) as either binary "1" or "-1". As an example, pixel w11=1, pixel w12=-1, pixel w13=-1, pixel w21=-1, pixel w22=1, pixel w23=-1, pixel w31=-1, pixel w32=1, and pixel w33=-1.

畳込みニューラルネットワーク部２４は、濃度マップ５０の画素ｘｉｊとフィルター５２の画素ｗｉｊとを掛け合わせ、図４（Ｂ）において説明したように、第１特徴データｙｉｊ＝ｘｉｊ×ｗｉｊを得る。具体的には、第１特徴データｙ１１＝－１×１＋－１×－１＋－１×－１＋－１×－１＋１×１＋１×－１＋－１×－１＋－１×１＋－１×－１＝３、第１特徴データｙ１２＝－１×１＋－１×－１＋－１×－１＋１×－１＋１×１＋１×－１＋－１×－１＋－１×１＋－１×－１＝１、・・・である。 The convolutional neural network unit 24 multiplies the pixel xij of the density map 50 and the pixel wij of the filter 52, and obtains the first feature data yij=xij×wij, as described with reference to FIG. 4(B). Specifically, the first feature data y11=-1×1+-1×-1+-1×-1+-1×-1+1×1+1×-1+-1×-1+-1×1+-1×-1= 3. First feature data y12=-1×1+-1×-1+-1×-1+1×-1+1×1+1×-1+-1×-1+-1×1+-1×-1=1,... It is.

畳込みニューラルネットワーク部２４は、第１特徴データｙ１１＝３、第１特徴データｙ１２＝１、・・・、および第１特徴データｙ７７＝－３をマトリクス状に並べ、図５（Ｂ）に示すように、第１特徴マップ５４を生成する。 The convolutional neural network unit 24 arranges the first feature data y11=3, the first feature data y12=1, . . . , and the first feature data y77=-3 in a matrix, as shown in FIG. 5(B). Thus, the first feature map 54 is generated.

さらに、畳込みニューラルネットワーク部２４は、図５（Ａ）に示す例ではフィルター５２が３×３の９個のマトリクスからなることから、第１特徴マップ５４を構成する各第１特徴データｙｉｊを例えば１／９の値に変換して、第２特徴マップ５６を生成する。つまり、第２特徴マップ５６が第２特徴データｚｉｊにより構成されるとすると、第２特徴データｚｉｊ＝第１特徴データｙｉｊ×１／９である。 Furthermore, since the filter 52 is composed of nine 3×3 matrices in the example shown in FIG. For example, the second feature map 56 is generated by converting the value to 1/9. That is, if the second feature map 56 is composed of the second feature data zij, then the second feature data zij=first feature data yij×1/9.

具体的には、第２特徴データｚ１１＝３×１／９＝０．３３、第２特徴データｚ１２＝１×１／９＝０．１１、・・・、第２特徴データｚ７７＝－３×１／９＝－０．３３である。畳込みニューラルネットワーク部２４は、これらの第２特徴データｚｉｊをマトリクス状に並べ、図５（Ｃ）に示すように、第２特徴マップ５６を生成する。以上が畳込みニューラルネットワーク部による畳込み処理である。 Specifically, second feature data z11=3×1/9=0.33, second feature data z12=1×1/9=0.11, ..., second feature data z77=-3× 1/9=-0.33. The convolutional neural network unit 24 arranges these second feature data zij in a matrix, and generates a second feature map 56 as shown in FIG. 5(C). The above is the convolution process by the convolution neural network unit.

次に、畳込みニューラルネットワーク部２４によるプーリング処理を説明する。図６（Ａ）に示すように、畳込みニューラルネットワーク部２４は、第２特徴マップ５６を複数の特徴マトリクス５６Ａ、特徴マトリクス５６Ｂ、・・・、に分割し、それぞれの特徴マトリクス５６Ａ、特徴マトリクス５６Ｂ、・・・、から代表値５６Ｃ、代表値５６Ｄ、・・・、を抽出する。 Next, pooling processing by the convolutional neural network unit 24 will be explained. As shown in FIG. 6A, the convolutional neural network unit 24 divides the second feature map 56 into a plurality of feature matrices 56A, feature matrices 56B, . . . Representative values 56C, 56D, . . . are extracted from 56B, .

具体的には、畳込みニューラルネットワーク部２４は、第２特徴マップ５６を特徴マトリクス５６Ａ、５６Ｂ、・・・、の２×２のマトリクスに分割する。ただし、２列に満たない場合は、１×２のマトリクスに分割する。そして、畳込みニューラルネットワーク部２４は、特徴マトリクス５６Ａから、特徴マトリクス５６Ａの代表値５６Ｃを抽出する。ここでは、畳込みニューラルネットワーク部２４は、代表値として、特徴マトリクス５６Ａの最大値を抽出するものとする。例えば、代表値５６Ｃは、特徴マトリクス５６Ａの最大値０．３３である。但し、畳込みニューラルネットワーク部２４は、代表値を、最大値ではなく、例えば、平均値又は中央値として抽出するようにしてもよい（本願明細書の全編に亘って同様）。 Specifically, the convolutional neural network unit 24 divides the second feature map 56 into 2×2 feature matrices 56A, 56B, . . . . However, if there are less than two columns, it is divided into a 1×2 matrix. Then, the convolutional neural network unit 24 extracts the representative value 56C of the feature matrix 56A from the feature matrix 56A. Here, it is assumed that the convolutional neural network unit 24 extracts the maximum value of the feature matrix 56A as a representative value. For example, the representative value 56C is the maximum value 0.33 of the feature matrix 56A. However, the convolutional neural network unit 24 may extract the representative value as an average value or a median value, for example, instead of the maximum value (the same applies throughout the specification of the present application).

続けて、畳込みニューラルネットワーク部２４は、特徴マトリクス５６Ｂを構成する各値の中から、特徴マトリクス５６Ｂの代表値５６Ｄを抽出する。ここでは、畳込みニューラルネットワーク部２４は、代表値として、特徴マトリクス５６Ｂの最大値を抽出するものとする。例えば、代表値５６Ｄは、特徴マトリクス５６Ｂの最大値０．１１である。 Continuously, the convolutional neural network unit 24 extracts the representative value 56D of the feature matrix 56B from among the values configuring the feature matrix 56B. Here, it is assumed that the convolutional neural network unit 24 extracts the maximum value of the feature matrix 56B as a representative value. For example, the representative value 56D is the maximum value 0.11 of the feature matrix 56B.

畳込みニューラルネットワーク部２４は、上記のようにして抽出した各代表値（代表値５６Ｃ、代表値５６Ｄ、・・・）を配置したマトリクスを生成する。以上がプーリング処理である。 The convolutional neural network unit 24 generates a matrix in which the representative values (representative value 56C, representative value 56D, . . . ) extracted as described above are arranged. The above is the pooling process.

更に、(i)畳込みニューラルネットワーク部２４は、予め定められた数の各代表値から構成されるマトリクス（図６（Ｂ）に、予め定められた数を３×３としたマトリクス５８の例を示す）が得られるまで、畳込み処理とプーリング処理を更に繰り返す。畳込みニューラルネットワーク部２４は、当該マトリクスをなす各代表値を一次元的に展開して、図９（Ａ）に一例を示す分類データ６２を生成する。(ii)或いは、畳込みニューラルネットワーク部２４は、(i)の場合よりも大きな予め定められた第２の数の各代表値から構成されるマトリクスが得られるまで畳込み処理とプーリング処理を更に繰り返し、当該マトリクスが得られた時点で全結合層の処理を行って、上記予め定められた数の各代表値から構成されるマトリクス（図６（Ｂ））を作成し、このマトリクスをなす各代表値を一次元的に展開して、図９（Ａ）に一例を示す分類データ６２を生成する。以上が、畳込みニューラルネットワーク部２４による処理である。 Furthermore, (i) the convolutional neural network unit 24 generates a matrix composed of a predetermined number of representative values (an example of a matrix 58 in which the predetermined number is 3×3 is shown in FIG. 6(B)). The convolution process and the pooling process are further repeated until . The convolutional neural network unit 24 one-dimensionally expands each representative value forming the matrix to generate classification data 62, an example of which is shown in FIG. 9(A). (ii) Alternatively, the convolutional neural network unit 24 further performs the convolution process and the pooling process until a matrix composed of representative values of a predetermined second number larger than in case (i) is obtained. Repeatedly, when the matrix is obtained, the fully connected layer is processed to create a matrix (FIG. 6(B)) consisting of the predetermined number of representative values, and each of the The representative values are expanded one-dimensionally to generate classification data 62, an example of which is shown in FIG. 9(A). The above is the processing by the convolutional neural network unit 24.

次に、図２に加え、図７、図８および図１１を参照して、グラッドカム処理について説明する。グラッドカム部２６は、グラッドカム処理（ＧｒａｄＣＡＭ：Ｇｒａｄｉｅｎｔ－ｗｅｉｇｈｔｅｄＣｌａｓｓＡｃｔｉｖａｔｉｏｎＭａｐｐｉｎｇ）を行う。グラッドカム部２６は、グラッドカム処理として、畳込みニューラルネットワーク部２４により生成された第２特徴マップ５６に、活性化関数ＲｅＬＵ（ＲｅｃｔｉｆｉｅｄＬｉｎｅａｒＵｎｉｔ）を適用する処理を行って、アクティベーションマップ６６（図８）を生成する。 Next, the glad cam process will be described with reference to FIGS. 7, 8, and 11 in addition to FIG. 2. The Grad cam unit 26 performs Gradient-weighted Class Activation Mapping (GradCAM). The GLAD cam unit 26 performs a process of applying an activation function ReLU (Rectified Linear Unit) to the second feature map 56 generated by the convolutional neural network unit 24 as GLAD cam processing, thereby creating an activation map 66 (FIG. 8). ) is generated.

活性化関数ＲｅＬＵは、図７に示すように、０未満の出力値を全て０にする関数である。すなわち、活性化関数ＲｅＬＵは、ある閾値以上の部分だけを意味のある情報とする処理、すなわち、当該ある閾値以上の部分を特徴部分として強調する処理である。活性化関数ＲｅＬＵは、横軸にｙｉｊをとり、縦軸にｆ（ｙｉｊ）をとる。ｙｉｊ＜０のとき、ｆ（ｙｉｊ）＝０であり、ｙｉｊ≧０のとき、ｆ（ｙｉｊ）＝ｙｉｊである。グラッドカム部２６は、第２特徴マップ５６の各値に活性化関数ＲｅＬＵを適用して、適用後の各値からなるマトリクス状のアクティベーションマップ６６を生成する。 The activation function ReLU is a function that sets all output values less than 0 to 0, as shown in FIG. In other words, the activation function ReLU is a process that makes only the part above a certain threshold value meaningful information, that is, a process that emphasizes the part above the certain threshold value as a characteristic part. The activation function ReLU has yij on the horizontal axis and f(yij) on the vertical axis. When yij<0, f(yij)=0, and when yij≧0, f(yij)=yij. The GLAD cam unit 26 applies the activation function ReLU to each value of the second feature map 56 to generate a matrix-like activation map 66 made up of the applied values.

図８は、アクティベーションマップ６６の一例を示す図である。アクティベーションマップ６６は、グラッドカム部２６により第２特徴マップ５６を構成する各第２特徴データｚｉｊを更に強調して示されることにより、原稿Ｇを読み取った文書データにおける特徴部分が強調されたものとなる。 FIG. 8 is a diagram showing an example of the activation map 66. In the activation map 66, each second feature data zij constituting the second feature map 56 is further emphasized and displayed by the GLAD cam unit 26, so that the feature portions in the document data read from the original G are emphasized. Become.

アクティベーションマップ６６は、上記原稿Ｇの会社名の記載部分の画像を含め、原稿Ｇの会社名の記載部分以外の部分も強調表示している。そのため、補正部３４は、アクティベーションマップ６６が、画像種別判定の根拠となるべき画像部分、例えば、原稿Ｇの会社名の記載部分の画像部分のみを的確に強調表示したものとなるように（全体画像において、どの画像部分が「会社名の記載部分」を示すかを明確にするために）、畳込みニューラルネットワーク部２４が用いる上記フィルター５２の重み付けである画素ｗｉｊを補正する。 The activation map 66 includes an image of the company name part of the manuscript G, and also highlights parts of the manuscript G other than the company name part. Therefore, the correction unit 34 makes the activation map 66 accurately highlight only the image portion that should be the basis for image type determination, for example, the image portion where the company name of the document G is written ( In order to clarify which image part in the entire image indicates the "part where the company name is written"), the pixel wij, which is the weighting of the filter 52 used by the convolutional neural network unit 24, is corrected.

ここで、グラッドカム部２６は、上述した分割処理を行って、アクティベーションマップ６６を複数のグループに分割する。そして、グラッドカム部２６は、当該複数のグループについてそれぞれ補正関数ｆ（ＧｒａｄＣＡＭ）を算出する。補正関数ｆ（ＧｒａｄＣＡＭ）は、数式（１）で表される。
・・・（１） Here, the GLAD cam unit 26 performs the above-described division process to divide the activation map 66 into a plurality of groups. Then, the GradCAM unit 26 calculates a correction function f (GradCAM) for each of the plurality of groups. The correction function f (GradCAM) is expressed by formula (1).
...(1)

数式（１）の補正関数ｆ（ＧｒａｄＣＡＭ）では、上記グループを構成する画素のうち特定の画素を画素ｃとし、上記グループを構成する画素であって画素ｃ以外の全ての画素をそれぞれ画素ｒとする。補正関数ｆは、アクティベーション関数Ａｃｔ（ｒ）と、画素ｃと各画素ｒとの距離の２乗を表す距離関数（ｒ－ｃ）＾２との積で表される。すなわち、補正関数ｆは、上記グループの画素ｃと全ての画素ｒとの各組み合わせについて、アクティベーション関数Ａｃｔ（ｒ）と距離関数（ｒ－ｃ）＾２との積をそれぞれ算出し、これら算出した全ての積の和を更に算出したものである。グラッドカム部２６は、分割した上記複数のグループ毎に補正関数ｆを算出する。 In the correction function f (GradCAM) of formula (1), a specific pixel among the pixels constituting the above group is designated as pixel c, and all pixels constituting the above group other than pixel c are respectively designated as pixel r. do. The correction function f is expressed as the product of the activation function Act(r) and a distance function (r−c)^2 representing the square of the distance between the pixel c and each pixel r. That is, the correction function f calculates the product of the activation function Act(r) and the distance function (r−c)^2 for each combination of pixel c and all pixels r in the above group, and calculates these products. The sum of all the products obtained is further calculated. The grad cam unit 26 calculates a correction function f for each of the plurality of divided groups.

ここでは、一例として、上記グループの最大濃度の画素を画素ｃとし、任意の画素を画素ｒとする。例えば、アクティベーション関数Ａｃｔ（ｒ）は、画素ｒにおけるアクティベーションマップ６６（上記グループ）の反応の大きさを表わす。 Here, as an example, the pixel of the group with the maximum density is designated as pixel c, and an arbitrary pixel is designated as pixel r. For example, the activation function Act(r) represents the magnitude of the response of the activation map 66 (the above group) at pixel r.

例えば、アクティベーション関数Ａｃｔ（ｒ）として、画素ｒの濃度を用いる。すなわち、画素ｒの濃度の高低に応じてアクティベーション関数Ａｃｔ（ｒ）の値は変化する。また、画素ｃと画素ｒとの距離が大きいほど距離関数（ｒ－ｃ）＾２の値は大きく、画素ｃと画素ｒとの距離が小さいほど距離関数（ｒ－ｃ）＾２の値は小さくなる。 For example, the density of pixel r is used as the activation function Act(r). That is, the value of the activation function Act(r) changes depending on the density of the pixel r. Also, the larger the distance between pixel c and pixel r, the larger the value of distance function (r-c)^2, and the smaller the distance between pixel c and pixel r, the greater the value of distance function (r-c)^2. becomes smaller.

続いて、図９を参照して、記憶部２８、比較部３０を説明する。 Next, the storage section 28 and the comparison section 30 will be explained with reference to FIG.

上述したように、記憶部２８は、教師データ６０（図９（Ａ））を記憶している。教師データ６０は、図９（Ａ）に示すように、上記分類データと同じ数の値が一次元的に並べられた数列からなる。記憶部２８は、原稿の種別を判別するための注目画像となる各画像に対応する教師データ６０を記憶している。 As described above, the storage unit 28 stores the teacher data 60 (FIG. 9(A)). As shown in FIG. 9(A), the teacher data 60 consists of a numerical sequence in which the same number of values as the classification data are arranged one-dimensionally. The storage unit 28 stores teacher data 60 corresponding to each image that is a target image for determining the type of document.

比較部３０は、畳込みニューラルネットワーク部２４から取得した分類データ６２と、記憶部２８が記憶する教師データ６０とを比較し、差分データ６４を算出する。更に、比較部３０は、上記分類データ６２と上記教師データ６０を用いて下記数式（２）により第１損失関数（Ｌｏｓｓｆｕｎｃｔｉｏｎ）を算出する。
・・・（２） The comparison unit 30 compares the classification data 62 acquired from the convolutional neural network unit 24 and the teacher data 60 stored in the storage unit 28, and calculates difference data 64. Further, the comparison unit 30 uses the classification data 62 and the teacher data 60 to calculate a first loss function using the following formula (2).
...(2)

そして、出力部３２は、グラッドカム部２６から補正関数ｆ（ＧｒａｄＣＡＭ）（数式１）の値を取得し、比較部３０によって算出された第１損失関数（数式（２））と補正関数ｆ（ＧｒａｄＣＡＭ）の値との和を第２損失関数（Ｌｏｓｓｆｕｎｃｔｉｏｎ）として算出する。すなわち、第２損失関数は、数式（３）で表される。出力部３２は、第２損失関数を、上記各グループの補正関数ｆをそれぞれ用いて当該グループ毎に算出する。
・・・（３） Then, the output unit 32 acquires the value of the correction function f(GradCAM) (Formula 1) from the GradCam unit 26, and the first loss function (Formula (2)) calculated by the comparison unit 30 and the correction function f(GradCAM) ) is calculated as a second loss function. That is, the second loss function is expressed by equation (3). The output unit 32 calculates a second loss function for each group using the correction function f of each group.
...(3)

補正部３４は、出力部３２が出力した上記各グループの第２損失関数を合計したものを補正値とし、この補正値を用いて、畳込みニューラルネットワーク部２４で用いる上記フィルター５２の重み付けを補正する。これにより、補正部３４は、該補正後のフィルター５２を作成する。この第２損失関数は、アクティベーションマップ６６において、上記判定の根拠とすべき原稿Ｇの会社名の記載部分の画像を的確に強調したものに基づくものであるため、補正部３４が、第２損失関数を用いてフィルター５２の重み付けを補正すると、畳込みニューラルネットワーク部２４により将来生成される分類データ６２（図９（Ａ））が教師データ６０に近づくよう、フィルター５２（図９（Ｂ））の画素ｗｉｊの数値を補正することになる。このようにフィルター５２を補正したフィルターをフィルター５２Ｂ（図９（Ｂ））とする。 The correction unit 34 uses the sum of the second loss functions of each group outputted by the output unit 32 as a correction value, and uses this correction value to correct the weighting of the filter 52 used in the convolutional neural network unit 24. do. Thereby, the correction unit 34 creates the filter 52 after the correction. This second loss function is based on the activation map 66 that accurately emphasizes the image of the company name written part of the document G, which should be the basis for the above determination. When the weighting of the filter 52 is corrected using the loss function, the filter 52 (FIG. 9(B) ) will be corrected. A filter obtained by correcting the filter 52 in this manner is referred to as a filter 52B (FIG. 9(B)).

具体的には、図１０（Ａ）に示すように、比較部３０は、分類データ６２Ａを取得すると、教師データ６０Ａと分類データ６２Ａとの差分をとって、差分データ６４Ａを算出する。また、比較部３０は、教師データ６０Ａと分類データ６２Ａから第１損失関数（数式２）を算出する。出力部３２は、上記グループ毎に、第１損失関数と、上記グループ別の補正関数ｆの値との和を算出し、この和を第２損失関数（数式３）とする。補正部３４は、各グループの第２損失関数を合計して上記補正値を算出する。補正部３４は、当該補正値に基づき（例えばフィルター５２に補正値を乗算して）、図９（Ｂ）に示すように、畳込みニューラルネットワーク部２４のフィルター５２を、例えば、フィルター５２Ｂのように補正する。 Specifically, as shown in FIG. 10A, when the comparison unit 30 obtains the classification data 62A, it calculates the difference data 64A by calculating the difference between the teacher data 60A and the classification data 62A. Furthermore, the comparison unit 30 calculates a first loss function (Formula 2) from the teacher data 60A and the classification data 62A. The output unit 32 calculates the sum of the first loss function and the value of the correction function f for each group, and sets this sum as a second loss function (Equation 3). The correction unit 34 calculates the correction value by summing the second loss functions of each group. Based on the correction value (for example, by multiplying the filter 52 by the correction value), the correction unit 34 changes the filter 52 of the convolutional neural network unit 24 to a filter 52B, for example, as shown in FIG. 9(B). Correct to.

このようにフィルター５２がフィルター５２Ｂに補正されると、畳込みニューラルネットワーク部２４が、補正後のフィルター５２Ｂを用いて、新たな分類データ（一例を図１０（Ｂ）に分類データ６２Ｂとして示す）を生成し、グラッドカム部２６は新たな補正関数ｆを上記グループ毎に算出する。そして、比較部３０は、教師データ６０Ｂと分類データ６２Ｂとを比較して、差分データ６４Ｂを出力する。このとき、分類データ６２Ｂは、分類データ６２Ａと比較して、教師データ６０Ｂに近似することになる。更に比較部３０は、上記分類データ６２Ｂと教師データ６０Ｂを用いて第１損失関数を算出し、出力部３２は、グラッドカム部２６が上記グループ毎に算出した上記新たな補正関数ｆを用いて、新たな第２損失関数を上記グループ毎に算出する。補正部３４は、グループ毎の新たな第２損失関数を合計した補正値を用いてフィルター５２Ｂを補正してフィルター５２Ｃを作成する。 When the filter 52 is corrected to the filter 52B in this way, the convolutional neural network unit 24 uses the corrected filter 52B to generate new classification data (an example is shown as classification data 62B in FIG. 10(B)). The grad cam unit 26 calculates a new correction function f for each group. The comparison unit 30 then compares the teacher data 60B and the classification data 62B and outputs difference data 64B. At this time, the classification data 62B approximates the teacher data 60B compared to the classification data 62A. Further, the comparison unit 30 uses the classification data 62B and the teacher data 60B to calculate a first loss function, and the output unit 32 uses the new correction function f calculated for each group by the GLAD cam unit 26 to calculate a first loss function. A new second loss function is calculated for each group. The correction unit 34 corrects the filter 52B using a correction value obtained by summing the new second loss functions for each group to create a filter 52C.

畳込みニューラルネットワーク部２４は、補正後の各フィルター５２Ｃを用いてそれぞれに、更に新しい分類データ（一例を図１０（Ｃ）に分類データ６２Ｃとして示す）を生成する。グラッドカム部２６は更に新たな補正関数ｆを上記グループ毎に算出する。そして、比較部３０は、教師データ６０Ｃと分類データ６２Ｃとを比較して、差分データ６４Ｃを出力する。分類データ６４Ｃは、分類データ６２Ｂと比較して、さらに教師データ６０Ｃに近似する（図１０（Ｃ）には、教師データ６０Ｃと分類データ６２Ｃとが一致する例を示している）。図１０（Ｃ）に示す例では、差分データ６４Ｃは、０のデータのみをもつことになる。 The convolutional neural network unit 24 generates new classification data (an example is shown as classification data 62C in FIG. 10C) using each of the corrected filters 52C. The grad cam unit 26 further calculates a new correction function f for each group. The comparison unit 30 then compares the teacher data 60C and the classification data 62C and outputs difference data 64C. The classification data 64C is more similar to the teacher data 60C than the classification data 62B (FIG. 10C shows an example in which the teacher data 60C and the classification data 62C match). In the example shown in FIG. 10(C), the difference data 64C has only 0 data.

図１１（Ａ）～（Ｃ）は、上述のように畳込みニューラルネットワーク部２４から補正部３４までによるフィルター補正処理、及び補正された新たなフィルターを用いた畳込みニューラルネットワーク部２４及びグラッドカム部２６によるアクティベーションマップ生成処理を上記グループ毎に繰り返すことで、グラッドカム部２６が生成する各グループのアクティベーションマップ６６が補正されていく様子を示す図である。 11A to 11C show the filter correction processing performed by the convolutional neural network unit 24 to the correction unit 34 as described above, and the convolutional neural network unit 24 and the GLAD cam unit using the corrected new filter. 26 is a diagram showing how the activation map 66 of each group generated by the GLAD cam section 26 is corrected by repeating the activation map generation process by 26 for each group. FIG.

例えば、図１１（Ａ）に示す、上記複数のグループのうちの１つのグループについての第１アクティベーションマップ６６Ａは、原稿Ｇの会社名の記載部分の画像のみでなく、原稿Ｇの全体に亘って高い濃度を示しているが、上記のようにしてフィルターが補正され、畳込みニューラルネットワーク部２４が、濃度マップ５０に新たなフィルター５２Ｂを用いて新たな第１特徴マップ５４及び新たな第２特徴マップ５６を生成し、グラッドカム部２６が新たな第２特徴マップ５６から第２アクティベーションマップ６６Ｂを生成すると、第２アクティベーションマップ６６Ｂでは、図１１（Ｂ）に示すように、第１アクティベーションマップ６６Ａと比較して、表題部分からの距離が大きい画素の濃度が大きく減少する。 For example, the first activation map 66A for one of the plurality of groups shown in FIG. However, the filter is corrected as described above, and the convolutional neural network unit 24 uses the new filter 52B for the density map 50 to create a new first feature map 54 and a new second feature map 50. When the feature map 56 is generated and the GLAD cam section 26 generates a second activation map 66B from the new second feature map 56, the second activation map 66B has the first activation map 66B as shown in FIG. 11(B). Compared to the activation map 66A, the density of pixels located at a large distance from the title portion is greatly reduced.

そして更に、畳込みニューラルネットワーク部２４から補正部３４までによるフィルター補正処理でフィルター５２Ｂがフィルター５２Ｃに補正され、補正された更に新たなフィルターを用いた畳込みニューラルネットワーク部２４及びグラッドカム部２６によるアクティベーションマップ生成処理により、第３アクティベーションマップ６６Ｃが生成されると、第３アクティベーションマップ６６Ｃでは、例えば、図１１（Ｃ）に示すように、第２アクティベーションマップ６６Ｂと比較して、表題部分からの距離が大きい画素の濃度が更に大きく減少する。図１１（Ｃ）には、第２アクティベーションマップ６６Ｂと比較して、原稿Ｇの会社名の記載部分の画像を示す画素からの距離が大きい画素の濃度はほぼ０に近くなり、原稿Ｇの会社名の記載部分の画像の濃度がより強調された例を示している。 Further, the filter 52B is corrected to the filter 52C by the filter correction processing performed by the convolutional neural network unit 24 to the correction unit 34, and the convolutional neural network unit 24 and the GLAD cam unit 26 perform an action using the corrected new filter. When the third activation map 66C is generated by the activation map generation process, for example, as shown in FIG. 11(C), the title of the third activation map 66C is The density of pixels further away from the portion is reduced even more. In FIG. 11C, compared to the second activation map 66B, the density of pixels located at a large distance from the pixel representing the image of the company name portion of the document G is almost 0, and the density of the pixels of the document G is close to 0. An example is shown in which the density of the image where the company name is written is more emphasized.

このように、畳込みニューラルネットワーク部２４から補正部３４までによるフィルター補正処理を繰り返し、例えば、教師データと分類データとが一致して差分データが０になった時点（或いは、教師データと分類データとが予め定められた範囲内まで近似した時点）で当該繰り返しを終了し、この時点で補正されている最新のフィルターを、畳込みニューラルネットワーク部２４が上記畳込み処理に用いるフィルターとして更新する。以上は、画像種別判定の根拠とすべき画像を抽出するための学習処理である。 In this way, the filter correction process from the convolutional neural network unit 24 to the correction unit 34 is repeated, and for example, when the teacher data and the classification data match and the difference data becomes 0 (or when the teacher data and the classification data The repetition is ended at the time when the values are approximated to within a predetermined range), and the convolutional neural network unit 24 updates the latest filter corrected at this time as the filter used for the above-mentioned convolution processing. The above is a learning process for extracting images to be used as the basis for image type determination.

また、画像種別判定の処理においては、分類部３６は、上記のように補正されて更新されている最新のフィルター５２を用いて畳込みニューラルネットワーク部２４により原稿画像（読取部４による原稿読取で得た画像）から生成された分類データが示す各値の配列によって、画像の種別を判定する。例えば、分類部３６は、（Ａ）分類データの第１番目のみが０よりも大きな値を示し、他の各値が可全て０の場合は、画像種別判定の根拠とすべき画像部分が「ＡＢＣＤ会社」であるとして「ＡＢＣＤ会社宛書類」と判定し、（Ｂ）分類データの第２番目のみが０よりも大きな値を示し、他の各値が可全て０の場合は、画像種別判定の根拠とすべき画像部分が「ＥＦＧＨ会社」であるとして「ＥＦＧＨ会社宛書類」と判定する、等のようにして画像の種別判定を行う。 In addition, in the process of image type determination, the classification unit 36 uses the latest filter 52 that has been corrected and updated as described above to use the convolutional neural network unit 24 to generate a document image (by reading the document by the reading unit 4). The type of image is determined based on the array of values indicated by the classification data generated from the obtained image. For example, if (A) only the first classification data indicates a value greater than 0 and all other values are 0, the classification unit 36 determines that the image portion to be used as the basis for image type determination is " (B) If only the second classification data shows a value greater than 0 and all other values are 0, the image type is determined. The type of image is determined by, for example, assuming that the image portion that should be the basis for the image is "EFGH company" and determining that it is a "document addressed to EFGH company."

従って、本実施形態に示した学習処理を終えれば、この後に畳込みニューラルネットワーク部２４により生成された分類データは教師データに一致又は近似するため、画像の種別を判定させるとき、画像種別判定の根拠とすべき画像部分としてのオブジェクトが単数ではなく複数存在する場合であっても、当該各オブジェクトを的確に抽出して、当該画像の種別を判定する精度を高く保つことができる。 Therefore, once the learning process shown in this embodiment is completed, the classification data generated by the convolutional neural network unit 24 after this matches or approximates the teacher data, so when determining the type of image, image type determination is performed. Even in the case where there is not a single object but a plurality of objects as image portions to be used as the basis for the image, it is possible to accurately extract each object and maintain high accuracy in determining the type of the image.

また、本実施形態では、画像種別判定の処理を行う度に、上記フィルター５２の補正処理までの学習処理も行うようにすれば、画像形成装置１００による原稿Ｇの読み取りの件数、および画像処理の件数が増えるに従って、畳込みニューラルネットワーク部２４のフィルターがより好適に補正されていくため、画像の種別の判定精度を向上させることができる。 Furthermore, in this embodiment, if the learning process up to the correction process of the filter 52 is also performed every time the image type determination process is performed, the number of times the image forming apparatus 100 reads the document G and the image processing As the number of cases increases, the filter of the convolutional neural network unit 24 is corrected more appropriately, so that the accuracy of image type determination can be improved.

また、図１乃至図１１を用いて説明した上記実施形態の構成及び処理は、本発明の一実施形態に過ぎず、本発明を当該構成及び処理に限定する趣旨ではない。 Further, the configuration and processing of the above embodiment described using FIGS. 1 to 11 are merely one embodiment of the present invention, and the present invention is not intended to be limited to the configuration and processing.

また、上記実施形態では、本発明に係る画像処理の一実施形態を、画像形成装置（複合機）に適用する例を説明しているが、これは一例に過ぎず、本発明に係る画像処理を、他の電子装置、例えば、医療機器、パーソナルコンピューター、携帯電話、スマートフォン、タブレット、ハブ装置、サーバー装置に適用するものとしてもよい。 Further, in the above embodiment, an example in which an embodiment of the image processing according to the present invention is applied to an image forming apparatus (multifunction device) is described, but this is only an example, and the image processing according to the present invention may be applied to other electronic devices, such as medical equipment, personal computers, mobile phones, smartphones, tablets, hub devices, and server devices.

１００画像形成装置
４読取部
２０処理部
２２生成部
２４畳込みニューラルネットワーク部
２６グラッドカム部
２８記憶部
３０比較部
３２出力部
３４補正部
３６分類部 100 Image forming device 4 Reading section 20 Processing section 22 Generation section 24 Convolutional neural network section 26 Grad cam section 28 Storage section 30 Comparison section 32 Output section 34 Correction section 36 Classification section

Claims

A processing unit including a generation unit, a convolutional neural network unit, a grad cam unit, a storage unit, a comparison unit, an output unit, and a correction unit,
The generation unit generates a density map from image data to be processed,
The convolutional neural network unit generates a feature map by filtering the density map, and generates classification data from the feature map,
The GRAD cam unit performs a process of generating an activation map from the feature map using an activation function, divides the activation map into a plurality of groups by clustering, and calculates a correction function for each of the plurality of groups. ,
The storage unit stores teacher data,
The comparison unit performs a process of calculating a first loss function for the classified data using the classified data and the teacher data,
The output unit adds the correction function to the first loss function calculated by the comparison unit to calculate a second loss function for each of the plurality of groups,
The correction unit corrects the weighting of the filter used in the convolutional neural network unit using a correction value that is the sum of the second loss functions calculated by the output unit,
From the process by the convolutional neural network unit using the new filter created by the correction unit, the process by the GLAD cam unit, the comparison unit, the output unit, and the correction unit is repeated to improve the filter. An image processing device that corrects and updates.

The GLAD cam unit performs division into the plurality of groups by detecting a center point with a high density for each cluster from the activation map, and dividing each point within a predetermined distance from the center point. The image processing apparatus according to claim 1, wherein the image processing is performed by forming a set of into one group.

The processing unit further includes a classification unit,
The image according to claim 1 or 2, wherein the classification unit determines the type of the image based on an array of values indicated by classification data generated by the convolutional neural network unit using the updated filter. Processing equipment.