JP7377661B2

JP7377661B2 - Image semantic region segmentation device, region detection sensitivity improvement method, and program

Info

Publication number: JP7377661B2
Application number: JP2019178591A
Authority: JP
Inventors: 美恵大串; 貴広馬場; 陽太 ▲高▼岡; 英雄寺田
Original assignee: Open Stream Inc
Current assignee: Open Stream Inc
Priority date: 2019-09-30
Filing date: 2019-09-30
Publication date: 2023-11-10
Anticipated expiration: 2039-09-30
Also published as: JP2021056721A

Description

本発明は、画像の意味的領域分割装置、領域検知感度向上方法、及びプログラムに関する。 The present invention relates to an image semantic region segmentation device, a method for improving region detection sensitivity, and a program.

デジタル画像の意味的領域分割（または、セマンティック・セグメンテーションともいう）において、ＤＣＮＮ（Deep Convolutional Neural Network）を用いた機械学習によって、画像の特徴抽出を自動的に行わせる手法が、近年よく用いられる。 In recent years, in semantic region segmentation (also referred to as semantic segmentation) of digital images, a method of automatically extracting image features by machine learning using DCNN (Deep Convolutional Neural Network) has been frequently used.

例えば、書類をスキャナーやカメラ等によりデジタル画像化した文書画像を対象に、意味的領域分割を実施し、各画素のクラス（＝要素種別）を判定するという課題がある。ここで、クラスとは、文字クラス、図形クラス、写真クラス、背景クラスなど、利用者が対象画像中で、意味的に区別したい分類種別に応じて定められるものである。 For example, there is a problem of performing semantic region segmentation on a document image obtained by converting the document into a digital image using a scanner, camera, etc., and determining the class (=element type) of each pixel. Here, the class is defined according to the classification type that the user wants to distinguish semantically in the target image, such as a character class, figure class, photograph class, or background class.

前記のＤＣＮＮにおける離散的２次元コンボリュージョン（Convolution、畳込み積分）演算は、画像パターンの空間的な形状の特徴を抽出するものである。例えば、ＤＣＮＮにおけるConvolution演算は、水平なエッジ、垂直なエッジ、斜めのエッジ、矩形の角、円・・・等々の、画像を構成する基本的な形状特徴をそれぞれ抽出する手段となり得る。Convolution演算を用いることで、理想的には、あらゆる特徴パターンと、特徴の発生方向に対応した（全方向に対応した）特徴抽出力を持たせることができる。しかしながら、機械学習に基づく現実のConvolution（以下、学習型Convolutionともいう）においては、有限回の学習によって獲得した重みパラメータによって特徴抽出の特性が決まるため、完全に理想的なものにはならず、一定の偏りを含むのが普通である。一定の偏りとは、例えば、右４５度のエッジに対する出力に比べて、左４５度のエッジに対する出力がわずかに大きい、といった偏りである。 The discrete two-dimensional convolution (Convolution) operation in the DCNN described above is for extracting spatial shape features of an image pattern. For example, the Convolution operation in DCNN can be a means of extracting basic shape features constituting an image, such as horizontal edges, vertical edges, diagonal edges, rectangular corners, circles, etc. By using the convolution operation, it is ideally possible to have feature extraction power that is compatible with all feature patterns and the directions in which the features occur (compatible with all directions). However, in real convolution based on machine learning (hereinafter also referred to as learning-type convolution), the characteristics of feature extraction are determined by the weight parameters acquired through a finite number of learnings, so it is not completely ideal. It is normal to include a certain amount of bias. The constant bias is, for example, a bias in which the output for an edge at 45 degrees to the left is slightly larger than the output for an edge at 45 degrees to the right.

したがって、前記の学習型Convolutionを用いた意味的領域分割では、入力画像に対して、完全に線対称、あるいは回転対称な出力とはならないことがある。例えば、ある画像パターンＡにおける領域分割結果と、パターンＡを９０度回転した画像における意味的領域分割結果は、理想的には一致すべきであるが、実際には、わずかに異なることがしばしば起こる。 Therefore, in the semantic region segmentation using the learning type convolution described above, the output may not be completely line-symmetric or rotationally symmetric with respect to the input image. For example, ideally, the region segmentation result for a certain image pattern A and the semantic region segmentation result for an image obtained by rotating pattern A by 90 degrees should match, but in reality, they often differ slightly. .

また、同様に、デジタル計算機上のConvolution計算は、離散的であるため、同じ入力画像パターンであっても、画像上の位置によって出力結果が異なる場合がある。例えば、あるパターンＢの出力ラベルと、パターンＢを１画素ずらした場合の出力ラベルが合致しないことがある。 Similarly, convolution calculations on a digital computer are discrete, so even if the input image pattern is the same, the output results may differ depending on the position on the image. For example, the output label of a certain pattern B may not match the output label of pattern B shifted by one pixel.

このような、機械学習によるConvolutionの現実的な特性により、意味的領域分割において、画像の回転や、上下反転、あるいは位置ずれ等が原因となり、本来検知されるべき画素が検知されないことがあり、結果的に検知感度の低下を招いている。 Due to the practical characteristics of convolution using machine learning, pixels that should be detected may not be detected due to rotation, vertical flipping, or positional shift of the image during semantic region segmentation. As a result, detection sensitivity is reduced.

特許文献１では、建築分野のコンクリート画像のセグメンテーション（領域分割）を、ＣＮＮ（Convolutional Neural Network））を用いて実施しており、その精度を高めるために、第１の機械学習による画像特徴検出器と、第２機械学習による領域分割器を用いるという方法で対処している。 In Patent Document 1, segmentation (area division) of concrete images in the architectural field is performed using CNN (Convolutional Neural Network), and in order to improve the accuracy, a first image feature detector using machine learning is used. This problem is dealt with by using a region divider based on second machine learning.

特開２０１９－１３３４３３号公報JP 2019-133433 Publication

本発明が解決しようとする課題は、機械学習による画像の意味的領域分割において、離散的Convolutionに起因する、領域の検知感度の低下を防ぐことである。特に文書画像のような、比較的均一な背景の上に文字や図形などが配置された画像（＝非自然画像）において顕著に効果を発揮することを狙いとしている。 The problem to be solved by the present invention is to prevent a decrease in region detection sensitivity due to discrete convolution in semantic region segmentation of an image using machine learning. The aim is to be particularly effective in images such as document images in which characters, figures, etc. are arranged on a relatively uniform background (=non-natural images).

特許文献１は、ＣＮＮを使ったセグメンテーション応用の例であるが、ここでは精度向上のために二つの機械学習器を使う方式である。二つの機械学習器を使用するため、その学習に要する計算時間やメモリ容量が肥大するという問題がある。また、同文献では、本件の背景技術で説明したような、機械学習によって得られる離散的Convolutionの特性をうまく活用していないという問題もある。 Patent Document 1 is an example of a segmentation application using CNN, but here it uses two machine learning devices to improve accuracy. Since two machine learning machines are used, there is a problem that the calculation time and memory capacity required for learning become large. Furthermore, this document also has the problem of not making good use of the characteristics of discrete convolution obtained by machine learning, as explained in the background art section of this case.

本発明では、前記のような離散的Convolutionの現実的な特性を考慮して上手く利用することにより、一つの機械学習器だけを用いて意味的領域分割の感度（精度ともいう）の向上を計る。 The present invention aims to improve the sensitivity (also called accuracy) of semantic region segmentation using only one machine learning machine by taking into account and effectively utilizing the realistic characteristics of discrete convolution as described above. .

本発明は、このような状況に鑑みてなされたもので、一つの機械学習器だけを用いて、意味的領域分割の感度を向上させることができる画像の意味的領域分割装置、領域検知感度向上方法、及びプログラムを提供する。 The present invention has been made in view of this situation, and provides an image semantic region segmentation device and region detection sensitivity improvement that can improve the sensitivity of semantic region segmentation using only one machine learning device. A method and program are provided.

本発明の上述した課題を解決するために、本発明は、文字と幾何学的図形とを含む対象画像における、画素ごとの画素値を含む画像情報を取得する取得部と、前記取得部により取得された前記画像情報に基づいて、前記対象画像における画素ごとの画素値を、画素がエッジである場合には第１画素値に変更し、画素がエッジでない場合には前記第１画素値とは異なる第２画素値に変更した変調画像を生成する生成部と、画像における画素が、文字を構成する要素を示す文字要素であるか、幾何学的図形を構成する要素を示す幾何学要素であるか、文字及び幾何学的図形ではない背景を構成する要素を示す背景要素であるかを区別する要素種別を、推定対象とする画像に関わらず同じ領域分割器を用いて推定する推定部（意味的領域分割を使用する部分）と、前記推定部により推定された、前記対象画像及び前記変調画像のそれぞれにおける、画素ごとの前記要素種別を推定した結果に基づいて、前記対象画像における画素ごとの前記要素種別を決定する決定部と、を備える判定装置である。
本発明の上述した課題を解決するために、本発明は、文字と幾何学的図形とを含む対象画像における、画素ごとの画素値を含む画像情報を取得する取得部と、前記取得部により取得された前記画像情報に基づいて、前記対象画像における画素ごとの画素座標を移動させた変調画像を生成する生成部と、画像における画素が、文字を構成する要素を示す文字要素であるか、幾何学的図形を構成する要素を示す幾何学要素であるか、文字及び幾何学的図形ではない背景を構成する要素を示す背景要素であるかを区別する要素種別を、畳み込み積分層を有する学習済みモデルを用いて推定する推定する推定部（意味的領域分割を使用する部分）と、前記推定部により推定された、前記対象画像及び前記変調画像のそれぞれにおける、画素ごとの前記要素種別を推定した結果に基づいて、前記対象画像における画素ごとの前記要素種別を決定する決定部と、を備える判定装置である。 In order to solve the above-mentioned problems of the present invention, the present invention includes an acquisition unit that acquires image information including pixel values for each pixel in a target image including characters and geometric figures; Based on the image information obtained, the pixel value of each pixel in the target image is changed to a first pixel value if the pixel is an edge, and the first pixel value is changed if the pixel is not an edge. a generation unit that generates a modulated image changed to a different second pixel value , and a pixel in the image is a character element indicating an element constituting a character or a geometric element indicating an element constituting a geometric figure; An estimation unit (meaning based on the result of estimating the element type for each pixel in each of the target image and the modulated image, which is estimated by the estimation unit. The determination device includes a determination unit that determines the element type.
In order to solve the above-mentioned problems of the present invention, the present invention includes an acquisition unit that acquires image information including pixel values for each pixel in a target image including characters and geometric figures; a generation unit that generates a modulated image by moving the pixel coordinates of each pixel in the target image based on the image information that has been The element type that distinguishes whether it is a geometric element that indicates an element that constitutes a geometric figure or a background element that indicates an element that constitutes a background that is not a character or a geometric figure has been learned with a convolution integral layer. An estimating unit that estimates using a model (a part that uses semantic region segmentation) estimates the element type for each pixel in each of the target image and the modulated image, which is estimated by the estimating unit. The determination device includes a determination unit that determines the element type for each pixel in the target image based on the result.

また、本発明は、上述の判定装置において、前記生成部は、前記対象画像における画素ごとの画素値を、当該画素がエッジであるか否かに応じて、予め定めた所定の画素値に変更する。 Further, in the above-mentioned determination device, the present invention provides that the generation unit changes the pixel value of each pixel in the target image to a predetermined pixel value depending on whether the pixel is an edge. do.

また、本発明は、上述の判定装置において、前記推定部は、学習済みモデルを用いて、画像における画素の前記要素種別を推定し、前記学習済みモデルは、学習用の画像である学習画像の画像情報と、前記学習画像における画素の前記要素種別とを対応付けた情報をデータセットとし、前記データセットを学習モデルに機械学習させた学習結果である。 Further, in the above-mentioned determination device, the present invention provides that the estimating unit estimates the element type of a pixel in an image using a trained model, and the trained model is a learning image that is a learning image. This is a learning result in which a data set is information that associates image information with the element type of a pixel in the learning image, and a learning model is subjected to machine learning on the data set.

また、本発明は、上述の判定装置において、前記決定部は、前記対象画像における所定画素の前記要素種別、及び前記変調画像における前記所定画素に対応する対応画素の前記要素種別のうち、少なくとも一方が前記文字要素である場合、前記所定画素の前記要素種別が前記文字要素であると決定し、所定画素の前記要素種別、及び前記対応画素の前記要素種別のうち、少なくとも一方が前記幾何学要素である場合、前記所定画素の前記要素種別が前記幾何学要素であると決定し、前記所定画素の前記要素種別、及び前記対応画素の前記要素種別が、共に前記背景要素である場合、前記所定画素の前記要素種別が前記背景要素であると決定する。 Further, in the above-mentioned determination device, the present invention provides at least one of the element type of a predetermined pixel in the target image and the element type of a corresponding pixel corresponding to the predetermined pixel in the modulated image. is the character element, the element type of the predetermined pixel is determined to be the character element, and at least one of the element type of the predetermined pixel and the element type of the corresponding pixel is the geometric element. If it is determined that the element type of the predetermined pixel is the geometric element, and the element type of the predetermined pixel and the element type of the corresponding pixel are both the background element, the predetermined The element type of the pixel is determined to be the background element.

また、本発明は、取得部が、文字と幾何学的図形とを含む対象画像における、画素ごとの画素値を含む画像情報を取得し、生成部が、前記取得部により取得された前記画像情報に基づいて、前記対象画像における画素ごとの画素値を、画素がエッジである場合には第１画素値に変更し、画素がエッジでない場合には前記第１画素値とは異なる第２画素値に変更した変調画像を生成し、推定部が、画像における画素が、文字を構成する要素を示す文字要素であるか、幾何学的図形を構成する要素を示す幾何学要素であるか、文字及び幾何学的図形ではない背景を構成する要素を示す背景要素であるかを区別する要素種別を、推定対象とする画像に関わらず同じ領域分割器を用いて推定し、決定部が、前記推定部により推定された、前記対象画像及び前記変調画像のそれぞれにおける画素ごとの前記要素種別を推定した結果に基づいて、前記対象画像における画素の前記要素種別を決定する判定方法である。
また、本発明は、取得部が、文字と幾何学的図形とを含む対象画像における、画素ごとの画素値を含む画像情報を取得し、生成部が、前記取得部により取得された前記画像情報に基づいて、前記対象画像における画素ごとの画素座標を移動させた変調画像を生成し、推定部が、画像における画素が、文字を構成する要素を示す文字要素であるか、幾何学的図形を構成する要素を示す幾何学要素であるか、文字及び幾何学的図形ではない背景を構成する要素を示す背景要素であるかを区別する要素種別を、畳み込み積分層を有する学習済みモデルを用いて推定し、決定部が、前記推定部により推定された、前記対象画像及び前記変調画像のそれぞれにおける画素ごとの前記要素種別を推定した結果に基づいて、前記対象画像における画素の前記要素種別を決定する判定方法である。 Further, the present invention provides an acquisition unit that acquires image information including a pixel value for each pixel in a target image including characters and geometric figures, and a generation unit that generates information about the image information acquired by the acquisition unit. , the pixel value of each pixel in the target image is changed to a first pixel value when the pixel is an edge, and a second pixel value different from the first pixel value when the pixel is not an edge. The estimator generates a modulated image changed to The determining unit estimates the element type to distinguish whether it is a background element indicating an element constituting a background that is not a geometric figure, using the same region divider regardless of the image to be estimated , and the determining unit This determination method determines the element type of a pixel in the target image based on the result of estimating the element type for each pixel in each of the target image and the modulated image.
Further, the present invention provides an acquisition unit that acquires image information including a pixel value for each pixel in a target image including characters and geometric figures, and a generation unit that generates information about the image information acquired by the acquisition unit. The estimator generates a modulated image in which the pixel coordinates of each pixel in the target image are moved based on A trained model with a convolutional integral layer is used to distinguish element types to distinguish between geometric elements indicating constituent elements and background elements indicating elements constituting backgrounds other than characters and geometric figures. and the determining unit determines the element type of the pixel in the target image based on the result of estimating the element type for each pixel in each of the target image and the modulated image, estimated by the estimating unit. This is a determination method.

また、本発明は、コンピュータを、上記に記載の判定装置として動作させるためのプログラムであって、前記コンピュータを前記判定装置が備える各部として機能させるためのプログラムである。 Further, the present invention is a program for causing a computer to operate as the determination device described above, and a program for causing the computer to function as each unit included in the determination device.

本発明によれば、画像内の文字と幾何学的図形とを区別することができる。 According to the present invention, characters and geometric figures in an image can be distinguished.

実施形態に係る領域分割装置１０の構成の例を示すブロック図である。FIG. 1 is a block diagram illustrating an example of the configuration of a region dividing device 10 according to an embodiment. 実施形態に係る領域分割装置１０が行う処理を説明する図である。FIG. 2 is a diagram illustrating processing performed by the region dividing device 10 according to the embodiment. 実施形態に係る領域分割装置１０が行う処理の流れを示すフローチャートである。It is a flowchart showing the flow of processing performed by the region dividing apparatus 10 according to the embodiment.

以下、発明の実施形態について図面を参照しながら説明する。 Embodiments of the invention will be described below with reference to the drawings.

領域分割装置１０は、画像に示されている内容の意味的な種類ごとに、画像の領域を分割（意味的領域分割）する装置である。以下の説明では、意味的な種類として、画像に示されている、線、文字、背景の種類ごとに領域分割する場合を例示して説明する。しかしながら、画像に示される内容の意味的な種類は、これに限定されることはない。領域分割装置１０は、線、文字、背景の他、例えば、画像、図形、記号、色、形状などの種別ごとに画像を領域分割してもよく、このような意味的な種類ごとに領域分割する場合にも、以下に説明する方法と同様な手法を適用することができる。 The region dividing device 10 is a device that divides the region of an image for each semantic type of content shown in the image (semantic region division). In the following description, an example will be explained in which regions are divided by the types of lines, characters, and background shown in an image as semantic types. However, the semantic type of content shown in the image is not limited to this. In addition to lines, characters, and backgrounds, the region dividing device 10 may divide an image into regions according to types such as images, figures, symbols, colors, shapes, etc., and divides the image into regions according to such semantic types. In this case, a method similar to the method described below can be applied.

領域分割装置１０は、画像の画素ごとに、当該画素が画像に示されている如何なる内容を構成する要素であるか判定し、判定した結果に基づいて、画像に示されている内容ごとに領域を分割する。すなわち、領域分割装置１０は、画像における画素が画像に示されている如何なる内容を構成する要素であるか判定する、「判定装置」の一例である。例えば、領域分割装置１０は、文字とそれ以外の要素との何れであるかを判定する。ここで、それ以外の要素とは、例えば、幾何学的図形である。幾何学的図形とは、幾何学的な図形であって、例えば、線、線分、一定条件を満たす状態で配置された記号の群などである。 The region dividing device 10 determines, for each pixel of the image, what kind of content shown in the image the pixel constitutes, and based on the determined result, divides the region into regions for each content shown in the image. Divide. That is, the area dividing device 10 is an example of a "determination device" that determines whether a pixel in an image is an element of what content shown in the image. For example, the area dividing device 10 determines whether the element is a character or another element. Here, the other elements are, for example, geometric figures. A geometric figure is a geometric figure, such as a line, a line segment, or a group of symbols arranged to satisfy certain conditions.

図１は、実施形態に係る領域分割装置１０の構成の例を示すブロック図である。領域分割装置１０は、例えば、画像情報取得部１１と、変調画像生成部１２と、要素種別推定部１３と、要素種別決定部１４と、領域マップ生成部１５と、マップ情報出力部１６とを備える。ここで、画像情報取得部１１は、「取得部」の一例である。変調画像生成部１２は、「生成部」の一例である。要素種別推定部１３は「推定部」の一例である。要素種別決定部１４は、「決定部」の一例である。 FIG. 1 is a block diagram showing an example of the configuration of a region dividing apparatus 10 according to an embodiment. The region dividing device 10 includes, for example, an image information acquisition section 11, a modulated image generation section 12, an element type estimation section 13, an element type determination section 14, a region map generation section 15, and a map information output section 16. Be prepared. Here, the image information acquisition section 11 is an example of an "acquisition section." The modulated image generation unit 12 is an example of a “generation unit”. The element type estimation unit 13 is an example of an “estimation unit”. The element type determining unit 14 is an example of a “determining unit”.

画像情報取得部１１は、スキャン画像Ｇ１１（図２参照）の画像情報を取得する。スキャン画像Ｇ１１は、線、及び文字を含む画像である。線が組み合わされる、或いは線の一部が屈曲（或いは湾曲）されることで、罫線や枠線などが構成される場合もある。スキャン画像Ｇ１１は、領域分割装置１０による領域分割の対象とする画像である。すなわち、スキャン画像Ｇ１１は、「対象画像」の一例である。 The image information acquisition unit 11 acquires image information of the scan image G11 (see FIG. 2). The scan image G11 is an image including lines and characters. A ruled line, a frame line, etc. may be formed by combining lines or bending (or curving) a part of the line. The scan image G11 is an image targeted for region division by the region division device 10. That is, the scan image G11 is an example of a "target image."

スキャン画像Ｇ１１は、例えばディスプレイやＷｅｂ上で表示される画像などを印刷した印刷画像Ｇ１０（図２参照）をスキャナーで読み込むことにより作成された画像である。画像情報は、画素ごとに、画像に関する情報が対応付けられた情報であり、例えば、画素ごとのグレースケール値が示された情報、或いは、画素ごとのＲＧＢ値が示された情報などである。画像情報取得部１１は、取得した画像情報を変調画像生成部１２、及び要素種別推定部１３に出力する。 The scanned image G11 is an image created by reading, for example, a printed image G10 (see FIG. 2), which is a printed image displayed on a display or the Web, with a scanner. The image information is information in which image-related information is associated with each pixel, such as information indicating a gray scale value for each pixel or information indicating RGB values for each pixel. The image information acquisition unit 11 outputs the acquired image information to the modulated image generation unit 12 and the element type estimation unit 13.

変調画像生成部１２は、画像情報取得部１１から取得した画像情報に基づいて、強調画像Ｇ１２（図２参照）を生成する。強調画像Ｇ１２は、スキャン画像Ｇ１１における画素ごとの画素値（グレースケール値や、ＲＧＢ値）を所定の変調条件に基づいて変更した画像であり、「変調画像」の一例である。 The modulated image generation unit 12 generates an enhanced image G12 (see FIG. 2) based on the image information acquired from the image information acquisition unit 11. The emphasized image G12 is an image in which the pixel value (gray scale value or RGB value) of each pixel in the scan image G11 is changed based on predetermined modulation conditions, and is an example of a "modulated image."

変調画像生成部１２は、例えば、スキャン画像Ｇ１１のエッジを強調する強調処理を行った画像を強調画像Ｇ１２として生成する。この場合、変調画像生成部１２は、スキャン画像Ｇ１１におけるエッジを検出する。変調画像生成部１２は、従来行われている任意の手法によりエッジを検出する。例えば、変調画像生成部１２は、スキャン画像Ｇ１１にメディアンフィルタ処理を行ったものと、ガウシアンフィルタなどによる平滑化処理を行ったものとの差分を検出することにより、エッジを検出する。或いは、変調画像生成部１２は、ラプラシアンフィルタやソーベル（Sobel）フィルタを適用することにより、スキャン画像Ｇ１１におけるエッジを検出するようにしてもよい。 The modulated image generation unit 12 generates, for example, an image that has undergone enhancement processing to emphasize the edges of the scan image G11 as an enhanced image G12. In this case, the modulated image generation unit 12 detects edges in the scan image G11. The modulated image generation unit 12 detects edges using any conventional method. For example, the modulated image generation unit 12 detects edges by detecting the difference between the scan image G11 subjected to median filter processing and the scan image G11 subjected to smoothing processing using a Gaussian filter or the like. Alternatively, the modulated image generation unit 12 may detect edges in the scan image G11 by applying a Laplacian filter or a Sobel filter.

変調画像生成部１２は、検出したエッジをある特定の画素値（例えば、「黒」を示すグレースケール値や、ＲＧＢ値）とし、その他のエッジとして検出されなかった画素の画素値を、別の特定の画素値（例えば、「白」を示すグレースケール値や、ＲＧＢ値）に変更した強調画像Ｇ１２を生成する。 The modulated image generation unit 12 sets the detected edge to a certain pixel value (for example, a gray scale value indicating "black" or an RGB value), and sets the pixel value of the pixel that is not detected as another edge to another pixel value. An enhanced image G12 is generated in which the pixel values are changed to specific pixel values (for example, grayscale values indicating "white" or RGB values).

なお、強調画像Ｇ１２は、上述したようなエッジを強調した画像に限定されることはない。強調画像Ｇ１２は、所定の変調条件に応じて生成された画像であればよく、例えば、画像内の画素を、水平方向、又は／及び垂直方向に、所定の距離（例えば、所定の数の画素分に対応する距離）を移動させた画像であってもよい。変調画像生成部１２は、生成した強調画像Ｇ１２の画像情報を要素種別推定部１３に出力する。 Note that the emphasized image G12 is not limited to an image with emphasized edges as described above. The emphasized image G12 may be any image generated according to predetermined modulation conditions, and for example, pixels within the image may be moved horizontally and/or vertically at a predetermined distance (for example, by a predetermined number of pixels). The image may also be an image that has been moved by a distance corresponding to minutes. The modulated image generation unit 12 outputs image information of the generated emphasized image G12 to the element type estimation unit 13.

要素種別推定部１３は、スキャン画像Ｇ１１、及び強調画像Ｇ１２のそれぞれについて、画素ごとの要素種別を推定する。要素種別は、画素が画像における如何なる種類を構成する要素であるかを示す情報であり、文字要素、線分要素、及び背景要素のいずれかを示す情報である。文字要素は、画素が画像における文字を構成する要素であることを示す。線分要素は、画素が画像における線分を構成する要素であることを示す。背景要素は、画素が画像における背景（線分ではなく、且つ文字ではないもの）を構成する要素であることを示す。ここで、線分要素は「幾何学要素」の一例である。 The element type estimation unit 13 estimates the element type for each pixel in each of the scan image G11 and the emphasized image G12. The element type is information indicating what kind of element a pixel constitutes in an image, and is information indicating one of a character element, a line segment element, and a background element. The character element indicates that the pixel is an element constituting a character in the image. A line segment element indicates that a pixel is an element constituting a line segment in an image. The background element indicates that the pixel is an element constituting the background (not a line segment and not a character) in an image. Here, the line segment element is an example of a "geometric element."

要素種別推定部１３は、例えば、機械学習の手法により画像における要素種別を推定する。例えば、要素種別推定部１３は、学習済みモデルを用いて、画像における要素種別を推定する。学習済みモデルは、画像情報と、画素ごとの要素種別との関係を学習したモデルである。 The element type estimation unit 13 estimates the element type in an image using, for example, a machine learning method. For example, the element type estimating unit 13 estimates the element type in the image using the learned model. The learned model is a model that has learned the relationship between image information and element type for each pixel.

（ＤＣＮＮの基本）
学習済みモデルの学習方法は、例えば、教師有り学習である。学習済みモデルは、学習用のデータセットを用いてＤＣＮＮ（Deep Convolutional Neural Network）などのモデルを学習させることにより生成される。ＤＣＮＮは、Convolution（畳込み積分）層を主要部分に使用する、深層形のニューラルネットワークである。画像認識においては、ＤＣＮＮにて、入力層に２次元のConvolution層を使用する。これにより、着目画素とその近傍にある画素の双方の情報を加味した画像特徴情報を効率よく認識できる。画像認識においては、さらに、２次元Convolutionを重ねて多層化して適用する。これにより、着目画素の近傍だけでなく、より離れた画素の情報も加味した大域的な画像特徴情報も認識できる。
（ＤＣＮＮの学習）
Convolution層の計算は、数学的な線形変換式(y=<W,x>+b)で表現することができる。すなわち、これは微分可能な計算式である。微分可能な計算層は、誤差逆伝播法として知られているニューラルネットの教師有り学習の原理を使って、学習を実行することが可能である。 (Basics of DCNN)
The learning method for the trained model is, for example, supervised learning. The trained model is generated by training a model such as a DCNN (Deep Convolutional Neural Network) using a training data set. DCNN is a deep neural network that mainly uses a convolution layer. In image recognition, a two-dimensional convolution layer is used as an input layer in DCNN. Thereby, image feature information that takes into account information about both the pixel of interest and pixels in its vicinity can be efficiently recognized. In image recognition, two-dimensional convolutions are further layered and applied. This makes it possible to recognize global image feature information that takes into account not only information in the vicinity of the pixel of interest but also information on pixels further away.
(Learning DCNN)
Convolution layer calculations can be expressed by a mathematical linear transformation formula (y=<W,x>+b). That is, this is a differentiable calculation formula. The differentiable computational layer can perform learning using the principle of supervised learning of neural networks known as error backpropagation.

ＤＣＮＮでは、ある層のユニットから、より深い層のユニットにデータが出力される際に、ユニット同士を接続するノードの結合係数に応じた重みＷ、及びバイアス成分ｂが付与されたデータが出力される。学習モデルは、入力されたデータ（入力データ）に対し、各ユニット間の演算を行い、出力層から出力データを出力する。 In DCNN, when data is output from a unit in a certain layer to a unit in a deeper layer, the data is given a weight W according to the coupling coefficient of the node connecting the units and a bias component b. Ru. The learning model performs calculations between each unit on input data (input data), and outputs output data from the output layer.

本実施形態における学習用のデータセットは、入力としての画素情報と、その画素ごとの要素種別とを対応付けた情報である。 The learning data set in this embodiment is information that associates pixel information as input with an element type for each pixel.

学習の過程において、学習モデルに、学習用のデータセットの入力データを入力させる。学習モデルは、入力データに対して出力層から出力されるデータ（出力データ）が、学習用のデータセットの出力に近づくように、学習モデルのパラメータ（重みＷ及びバイアス成分ｂ）を調整することにより、学習モデルを学習させる。 During the learning process, input data of a training dataset is input to the learning model. The learning model adjusts the parameters (weight W and bias component b) of the learning model so that the data output from the output layer (output data) with respect to the input data approaches the output of the training dataset. The learning model is trained.

例えば、ＤＣＮＮモデルのパラメータ（重みＷ、及びバイアス成分ｂ）の調整には、誤差逆伝搬法が用いられる。誤差逆伝搬法では、学習モデルの出力層から出力されるデータと、学習用データとセットの出力との乖離度合いが、損失関数として表現される。ここでの乖離度合いには、任意の指標が用いられてよいが、例えば、誤差の二乗（二乗誤差）やクロスエントロピー等が用いられる。誤差逆伝搬法では、出力層から入力層側に至る方向に、損失関数が最小となるように、重みＷとバイアス成分ｂの値を決定（更新）する。これにより学習モデルを学習させ、推定の精度を向上させる。 For example, the error backpropagation method is used to adjust the parameters (weight W and bias component b) of the DCNN model. In the error backpropagation method, the degree of deviation between the data output from the output layer of the learning model and the output of the training data and set is expressed as a loss function. Any index may be used for the degree of deviation here, and for example, the square of the error (squared error), cross entropy, etc. may be used. In the error backpropagation method, the values of the weight W and the bias component b are determined (updated) so that the loss function is minimized in the direction from the output layer to the input layer side. This allows the learning model to learn and improves estimation accuracy.

なお、学習モデルは、ＤＣＮＮに限定されることはない。学習モデルとして、例えば、ＣＮＮ、決定木、階層ベイズ、ＳＶＭ（Support Vector Machine）などの手法が用いられてもよい。 Note that the learning model is not limited to DCNN. As the learning model, for example, CNN, decision tree, hierarchical Bayes, SVM (Support Vector Machine), or other methods may be used.

要素種別推定部１３は、学習済みモデルに画像情報を入力することにより、学習済みモデルからの出力（要素種別）を取得する。学習済みモデルからの出力は、例えば、「文字要素である可能性が１２％、線分要素である可能性が８０％、背景要素である可能性が８％」など、要素種別のそれぞれである可能性を、確立で示す情報である。変調画像生成部１２は、学習済みモデルからの出力に基づいて、例えば、画素ごとの、最も高い確立で示される要素種別を、その画像における要素種別と推定する。 The element type estimation unit 13 acquires an output (element type) from the learned model by inputting image information to the learned model. The output from the trained model is for each element type, for example, ``12% chance of being a text element, 80% chance of being a line element, 8% chance of being a background element.'' This is information that indicates possibility by probability. The modulated image generation unit 12 estimates, for example, the element type indicated by the highest probability for each pixel as the element type in the image, based on the output from the trained model.

要素種別推定部１３は、学習済みモデルにスキャン画像Ｇ１１の画像情報を入力することにより得られる出力に基づいて、スキャン画像Ｇ１１における画素ごとの要素種別を推定し、推定結果を、要素種別決定部１４に出力する。要素種別推定部１３は、学習済みモデルに強調画像Ｇ１２の画像情報を入力することにより得られる出力に基づいて、強調画像Ｇ１２における画素ごとの要素種別を推定し、推定結果を、要素種別決定部１４に出力する。 The element type estimation unit 13 estimates the element type for each pixel in the scan image G11 based on the output obtained by inputting the image information of the scan image G11 into the learned model, and sends the estimation result to the element type determination unit. Output to 14. The element type estimating unit 13 estimates the element type for each pixel in the emphasized image G12 based on the output obtained by inputting the image information of the emphasized image G12 into the learned model, and sends the estimation result to the element type determining unit. Output to 14.

なお、上記では、要素種別推定部１３が機械学習の手法を用いて、画素ごとの要素種別を推定する場合を例に説明した。しかしながら、これに限定されることはない。要素種別推定部１３は、機械学習の手法を用いない方法、例えば、ルールベースにより画素ごとの要素種別を推定するようにしてもよい。この場合、予め登録したルールに基づく推定がなされる。ここでのルールとは、画素の要素種別に応じた条件を規定するものであって、例えば、所定のグレースケール値の画素が、水平方向に所定の数、連続している場合、これらの画素を線分要素とする等のルールである。 Note that the above description has been made using an example in which the element type estimating unit 13 estimates the element type for each pixel using a machine learning method. However, it is not limited to this. The element type estimation unit 13 may estimate the element type for each pixel using a method that does not use a machine learning method, for example, using a rule base. In this case, estimation is made based on rules registered in advance. The rules here specify conditions according to the element type of the pixel. For example, if a predetermined number of pixels with a predetermined gray scale value are consecutive in the horizontal direction, these pixels Rules such as making the line element a line element.

要素種別決定部１４は、要素種別推定部１３から取得した、スキャン画像Ｇ１１及び強調画像Ｇ１２の双方における、画素ごとの要素種別を推定した推定結果に基づいて、スキャン画像Ｇ１１における要素種別を決定する。 The element type determination unit 14 determines the element type in the scan image G11 based on the estimation result of estimating the element type for each pixel in both the scan image G11 and the emphasized image G12, which is obtained from the element type estimation unit 13. .

要素種別決定部１４は、例えば、スキャン画像Ｇ１１における所定の画素（所定画素）に対応する、強調画像Ｇ１２の画素（対応画素）を取得する。所定画素と対応画素との関係は変調処理（所定の変調条件に対応する処理）に応じて任意に決定されてよい。例えば、変調処理がエッジを強調する処理である場合、所定画素と対応画素とは、それぞれの画像（スキャン画像Ｇ１１と強調画像Ｇ１２）において、それぞれの画像において同じ位置座標に位置する画素である。例えば、変調処理が、画像内の画素を、水平方向、又は／及び垂直方向に、所定の距離を移動させる処理である場合、所定画素と対応画素との関係は、所定画素の位置座標から所定の距離を移動させた位置が、対応画素の位置座標となる関係である。 The element type determining unit 14 obtains, for example, a pixel (corresponding pixel) of the emphasized image G12 that corresponds to a predetermined pixel (predetermined pixel) in the scan image G11. The relationship between a predetermined pixel and a corresponding pixel may be arbitrarily determined according to modulation processing (processing corresponding to predetermined modulation conditions). For example, when the modulation process is a process that emphasizes edges, the predetermined pixel and the corresponding pixel are pixels located at the same position coordinates in each image (scan image G11 and emphasized image G12). For example, if the modulation process is a process of moving a pixel in an image a predetermined distance in the horizontal and/or vertical direction, the relationship between the predetermined pixel and the corresponding pixel is determined by the position coordinate of the predetermined pixel. The position after moving the distance is the position coordinates of the corresponding pixel.

要素種別決定部１４は、所定画素の推定結果と、対応画素の推定結果とに基づいて、所定画素の要素種別を推定する。要素種別決定部１４は、所定画素の要素種別、及び対応画素の要素種別の少なくとも一方が文字要素であると推定されている場合、所定画素は文字要素であると決定する。すなわち、要素種別決定部１４は、所定画素が文字要素と推定されている場合には、対応画素において推定された要素種別にかかわらず、当該所定画素を文字要素と決定する。また、要素種別決定部１４は、対応画素が文字要素と推定されている場合には、所定画素において推定された要素種別にかかわらず、当該所定画素を文字要素と決定する。 The element type determination unit 14 estimates the element type of the predetermined pixel based on the estimation result of the predetermined pixel and the estimation result of the corresponding pixel. If at least one of the element type of the predetermined pixel and the element type of the corresponding pixel is estimated to be a text element, the element type determination unit 14 determines that the predetermined pixel is a text element. That is, when the predetermined pixel is estimated to be a character element, the element type determination unit 14 determines the predetermined pixel to be a character element, regardless of the element type estimated for the corresponding pixel. Furthermore, when the corresponding pixel is estimated to be a text element, the element type determination unit 14 determines the predetermined pixel to be a text element, regardless of the element type estimated for the predetermined pixel.

要素種別決定部１４は、所定画素の要素種別、及び対応画素の要素種別の少なくとも一方が線分要素であると推定されている場合、所定画素は線分要素であると決定する。すなわち、要素種別決定部１４は、所定画素が線分要素と推定されている場合には、対応画素において推定された要素種別にかかわらず、当該所定画素を線分要素と決定する。また、要素種別決定部１４は、対応画素が線分要素と推定されている場合には、所定画素において推定された要素種別にかかわらず、当該所定画素を線分要素と決定する。 If at least one of the element type of the predetermined pixel and the element type of the corresponding pixel is estimated to be a line segment element, the element type determination unit 14 determines that the predetermined pixel is a line segment element. That is, when the predetermined pixel is estimated to be a line segment element, the element type determination unit 14 determines the predetermined pixel to be a line segment element, regardless of the element type estimated for the corresponding pixel. Furthermore, when the corresponding pixel is estimated to be a line segment element, the element type determination unit 14 determines the predetermined pixel to be a line segment element, regardless of the element type estimated for the predetermined pixel.

要素種別決定部１４は、所定画素の要素種別、及び対応画素の要素種別の双方が背景要素であると推定されている場合、所定画素は背景要素であると決定する。すなわち、要素種別決定部１４は、所定画素が背景要素と推定され、且つ、対応画素が背景要素と推定されている場合に、当該所定画素を背景要素と決定する。要素種別決定部１４は、スキャン画像Ｇ１１において画素ごとに決定した要素種別を示す情報を、領域マップ生成部１５に出力する。 The element type determining unit 14 determines that the predetermined pixel is a background element when both the element type of the predetermined pixel and the element type of the corresponding pixel are estimated to be background elements. That is, when the predetermined pixel is estimated to be a background element and the corresponding pixel is estimated to be a background element, the element type determining unit 14 determines the predetermined pixel to be a background element. The element type determination unit 14 outputs information indicating the element type determined for each pixel in the scan image G11 to the area map generation unit 15.

領域マップ生成部１５は、要素種別決定部１４からのスキャン画像Ｇ１１において画素ごとに決定した要素種別を示す情報に基づいて、領域マップを生成する。領域マップは、画素に要素種別が対応づけられたマップ（画像）である。領域マップ生成部１５は、例えば、要素種別ごとに領域マップを生成する。
領域マップ生成部１５は、要素種別が文字要素である画素をある特定の色（例えば、黒）とし、文字要素ではない画素を、別の色（例えば、白）とすることにより、文字要素の領域マップを生成する。
領域マップ生成部１５は、要素種別が線分要素である画素をある特定の色（例えば、黒）とし、線分要素ではない画素を、別の色（例えば、白）とすることにより、線分要素の領域マップを生成する。
領域マップ生成部１５は、要素種別が背景要素である画素をある特定の色（例えば、黒）とし、背景要素ではない画素を、別の色（例えば、白）とすることにより、背景要素の領域マップを生成する。
領域マップ生成部１５は、生成した領域マップを示す情報を記憶部（不図示）に記憶させる。 The area map generation unit 15 generates an area map based on information indicating the element type determined for each pixel in the scan image G11 from the element type determination unit 14. The area map is a map (image) in which pixels are associated with element types. The area map generation unit 15 generates an area map for each element type, for example.
The area map generation unit 15 sets pixels whose element type is a text element to a certain specific color (for example, black), and sets pixels whose element type is not a text element to a different color (for example, white). Generate a region map.
The area map generation unit 15 sets pixels whose element type is a line segment element to a certain specific color (for example, black), and sets pixels whose element type is not a line segment element to a different color (for example, white). Generate a region map of minute elements.
The area map generation unit 15 sets pixels whose element type is a background element to a certain color (for example, black), and sets pixels that are not background elements to a different color (for example, white), thereby changing the color of the background element. Generate a region map.
The area map generation unit 15 causes a storage unit (not shown) to store information indicating the generated area map.

マップ情報出力部１６は、ユーザの操作などに応じて記憶部を参照し、所定の領域マップを示す情報を出力する。マップ情報出力部１６は、領域分割装置１０に接続されたディスプレイに領域マップを示す情報を出力することにより領域マップを表示させるようにしてもよい。また、マップ情報出力部１６は、領域分割装置１０に接続されたプリンタに領域マップを示す情報を出力することにより、領域マップを印刷するようにしてもよい。 The map information output unit 16 refers to the storage unit according to a user's operation and outputs information indicating a predetermined area map. The map information output unit 16 may display the area map by outputting information indicating the area map to a display connected to the area dividing device 10. Furthermore, the map information output unit 16 may print the area map by outputting information indicating the area map to a printer connected to the area dividing device 10.

図２は、実施形態に係る領域分割装置１０が行う処理を説明する図である。
図２に示すように、印刷画像Ｇ１０は、例えば、「あいうえお」、「Ｘ」、「ＡＢＣ」などの文字、及び、複数の線分が組み合わされた枠線が示されている画像である。この例に示すように、印刷画像Ｇ１０には、太字や細字の文字が混在していてもよいし、文字の色や背景色が異なる文字が混在していてもよい。また、枠の中に文字が記載されていてもよいし、枠の中に更に枠線が描画されていてもよい。
スキャン画像Ｇ１１（「対象画像」の一例）は、例えば、スキャナーによる読み取りの際に画像全体にノイズが生じ、印刷画像Ｇ１０において白で示されていた部分が薄い灰色に、黒で示されていた文字や背景の色が濃い灰色に変化したような画像である。 FIG. 2 is a diagram illustrating processing performed by the region dividing apparatus 10 according to the embodiment.
As shown in FIG. 2, the print image G10 is an image in which, for example, characters such as "AIUEO", "X", and "ABC" are shown, as well as a frame line that is a combination of a plurality of line segments. As shown in this example, the print image G10 may include a mixture of bold and thin characters, and may also include characters with different character colors and background colors. Furthermore, characters may be written within the frame, or a frame line may be further drawn within the frame.
In the scanned image G11 (an example of a "target image"), for example, noise occurred throughout the image when it was read by a scanner, and the parts shown in white in the printed image G10 were shown in light gray and black. This is an image in which the text and background appear to have changed to a dark gray color.

変調画像生成部１２は、スキャン画像Ｇ１１の画像情報に基づいて、所定の処理（ここでは「変調処理」と記載）を行うことにより、強調画像Ｇ１２を生成する。強調画像Ｇ１２は、例えば、文字のエッジ、及び枠線のエッジが共に強調された画像である。この例では、スキャン画像Ｇ１１において太字で記載された文字は、その文字の色にかかわらず、強調画像Ｇ１２において、いわゆる白抜き文字のように変換されている。また、スキャン画像Ｇ１１において細字で示されていた文字は、強調画像Ｇ１２でそのまま文字の形状に沿って示される。また、スキャン画像Ｇ１１において太めのラインで描画されていた線分は、強調画像Ｇ１２において二重の枠のように変換されている。 The modulated image generation unit 12 generates an emphasized image G12 by performing a predetermined process (herein referred to as "modulation process") based on the image information of the scan image G11. The emphasized image G12 is, for example, an image in which both the edges of characters and the edges of frames are emphasized. In this example, characters written in bold in the scanned image G11 are converted into so-called white characters in the emphasized image G12, regardless of the color of the characters. Further, the characters shown in fine print in the scan image G11 are shown as they are in the emphasized image G12 along the shape of the characters. Furthermore, line segments drawn as thick lines in the scan image G11 have been converted to look like double frames in the emphasized image G12.

要素種別推定部１３（図２では、領域分割器（処理）と記載）は、スキャン画像Ｇ１１、について、画素ごとの要素種別を推定する。要素種別推定部１３は、推定結果としてのスキャン画像Ｇ１１の推定領域マップＭ１０を出力する。推定領域マップＭ１０は、スキャン画像Ｇ１１の画素ごとに、推定した要素種別が対応付けられたマップ（画像）である。このように、要素種別推定部１３は、推定結果を、マップ（画像）の形式にて出力するようにしてもよい。 The element type estimating unit 13 (described as region divider (processing) in FIG. 2) estimates the element type for each pixel in the scan image G11. The element type estimating unit 13 outputs an estimated area map M10 of the scan image G11 as an estimation result. The estimated area map M10 is a map (image) in which estimated element types are associated with each pixel of the scan image G11. In this way, the element type estimation unit 13 may output the estimation result in the form of a map (image).

また、要素種別推定部１３は、強調画像Ｇ１２について、画素ごとの要素種別を推定する。要素種別推定部１３は、推定結果としての強調画像Ｇ１２の推定領域マップＭ１１を出力する。推定領域マップＭ１１は、強調画像Ｇ１２の画素ごとに、推定した要素種別が対応付けられたマップ（画像）である。このように、要素種別推定部１３は、推定結果を、マップ（画像）の形式にて出力するようにしてもよい。 Furthermore, the element type estimating unit 13 estimates the element type for each pixel with respect to the emphasized image G12. The element type estimation unit 13 outputs an estimation area map M11 of the emphasized image G12 as an estimation result. The estimated region map M11 is a map (image) in which estimated element types are associated with each pixel of the emphasized image G12. In this way, the element type estimation unit 13 may output the estimation result in the form of a map (image).

要素種別決定部１４は、推定領域マップＭ１０、及びＭ１１を合成することにより、スキャン画像Ｇ１１の画素ごとの要素種別を決定する。ここでの合成は、上述したような、スキャン画像Ｇ１１における所定画素の推定結果と、強調画像Ｇ１２における対応画素の推定結果に応じて所定画素の要素種別が決定される処理が行われることを示している。
領域マップ生成部１５は、要素種別決定部１４により決定された、スキャン画像Ｇ１１の画素ごとの要素種別に基づいて、スキャン画像Ｇ１１の領域マップＭ１２を生成する。 The element type determination unit 14 determines the element type for each pixel of the scan image G11 by combining the estimated area maps M10 and M11. The combination here indicates that the element type of a predetermined pixel is determined according to the estimation result of a predetermined pixel in the scan image G11 and the estimation result of a corresponding pixel in the emphasized image G12, as described above. ing.
The area map generation unit 15 generates the area map M12 of the scan image G11 based on the element type for each pixel of the scan image G11 determined by the element type determination unit 14.

図３は、実施形態に係る領域分割装置１０が行う処理の流れを示すフローチャートである。領域分割装置１０の画像情報取得部１１は、スキャン画像Ｇ１１の画像情報を取得する（ステップＳ１０）。変調画像生成部１２は、スキャン画像Ｇ１１の画像情報に基づいて、強調画像Ｇ１２を生成する（ステップＳ１１）。要素種別推定部１３は、スキャン画像Ｇ１１の画像情報に基づいて、スキャン画像Ｇ１１における画素の要素種別を推定する（ステップＳ１２）。要素種別推定部１３は、強調画像Ｇ１２の画像情報に基づいて、強調画像Ｇ１２における画素の要素種別を推定する（ステップＳ１３）。要素種別決定部１４は、スキャン画像Ｇ１１、及び強調画像Ｇ１２における、それぞれの画素の要素種別の推定結果に基づいて、スキャン画像Ｇ１１における画素ごとの要素種別を決定する（ステップＳ１４）。領域マップ生成部１５は、スキャン画像Ｇ１１における画素ごとの要素種別に基づいて、要素種別ごとの領域マップＭ１２を生成する（ステップＳ１５）。 FIG. 3 is a flowchart showing the flow of processing performed by the region dividing apparatus 10 according to the embodiment. The image information acquisition unit 11 of the region dividing device 10 acquires image information of the scan image G11 (step S10). The modulated image generation unit 12 generates an enhanced image G12 based on the image information of the scan image G11 (step S11). The element type estimation unit 13 estimates the element type of the pixel in the scan image G11 based on the image information of the scan image G11 (step S12). The element type estimating unit 13 estimates the element type of the pixel in the emphasized image G12 based on the image information of the emphasized image G12 (step S13). The element type determination unit 14 determines the element type for each pixel in the scan image G11 based on the estimation result of the element type of each pixel in the scan image G11 and the emphasized image G12 (step S14). The area map generation unit 15 generates an area map M12 for each element type based on the element type for each pixel in the scan image G11 (step S15).

なお、上述したフローでは、ステップＳ１２にて、スキャン画像Ｇ１１における画素の要素種別を推定した後に、ステップＳ１３にて、強調画像Ｇ１２における画素の要素種別を推定する場合の例をしめしたが、ステップＳ１３に示す処理が行われた後に、ステップＳ１３に示す処理が行われてもよい。 Note that in the above-described flow, an example is shown in which the element type of the pixel in the scanned image G11 is estimated in step S12, and then the element type of the pixel in the emphasized image G12 is estimated in step S13. After the process shown in S13 is performed, the process shown in step S13 may be performed.

以上説明したように、実施形態の領域分割装置１０は、画像情報取得部１１と、変調画像生成部１２と、要素種別推定部１３と、要素種別決定部１４とを備える。画像情報取得部１１は、スキャン画像Ｇ１１の画像情報を取得する。変調画像生成部１２は、スキャン画像Ｇ１１における画素ごとの画素値を、所定の変調条件に応じて変更した強調画像Ｇ１２を生成する。要素種別推定部１３は、画像における画素ごとの要素種別を推定する。要素種別決定部１４は、画像における画素ごとの要素種別を推定した推定結果に基づいて、スキャン画像Ｇ１１の要素種別を決定する。これにより、実施形態の領域分割装置１０では、スキャン画像Ｇ１１の要素種別として、文字要素と線分要素とを決定することができる。 As described above, the region dividing apparatus 10 of the embodiment includes the image information acquisition section 11, the modulated image generation section 12, the element type estimation section 13, and the element type determination section 14. The image information acquisition unit 11 acquires image information of the scan image G11. The modulated image generation unit 12 generates an emphasized image G12 in which the pixel value of each pixel in the scan image G11 is changed according to predetermined modulation conditions. The element type estimation unit 13 estimates the element type for each pixel in the image. The element type determining unit 14 determines the element type of the scan image G11 based on the estimation result of estimating the element type for each pixel in the image. Thereby, the region dividing apparatus 10 of the embodiment can determine character elements and line segment elements as the element types of the scan image G11.

ここで、比較例として、スキャン画像Ｇ１１の推定結果のみを用いて、要素種別を決定する構成を考える。一般に、学習済みモデルを用いて推定を行う場合、学習用のデータセットと同じか、或いは類似する入力については精度よく推定を行うことができる。一方、学習用のデータセットにない入力について、精度よく推定を行うことが困難である。このため、スキャン画像Ｇ１１の画像情報に、学習用のデータセットにはなかった画素の配置パターンが含まれていた場合、そのパターンに含まれる画素や、そのパターン周辺の画素の推定結果に誤りがある可能性が高い。学習済みモデルを学習し直さない限り、推定結果が変化することはないため、このような場合には、推定の精度を向上させることはできない。すなわち、スキャン画像Ｇ１１の画像情報と、学習用のデータセットの内容とが乖離する部分について精度よく推定することが困難となってしまう。学習済みモデルの代わりにルールベースの推定（判定）結果を用いる場合でも同様である。 Here, as a comparative example, consider a configuration in which the element type is determined using only the estimation result of the scan image G11. Generally, when performing estimation using a trained model, it is possible to perform estimation with high accuracy for inputs that are the same as or similar to the training dataset. On the other hand, it is difficult to accurately estimate inputs that are not in the learning dataset. Therefore, if the image information of scanned image G11 includes a pixel arrangement pattern that was not in the training dataset, errors may occur in the estimation results for pixels included in that pattern or pixels around that pattern. There is a high possibility that there is. Since the estimation results will not change unless the trained model is retrained, the estimation accuracy cannot be improved in such a case. That is, it becomes difficult to accurately estimate a portion where the image information of the scan image G11 and the content of the learning data set deviate. The same holds true even when a rule-based estimation (determination) result is used instead of a learned model.

これに対し、本実施形態の領域分割装置１０では、スキャン画像Ｇ１１の推定結果と、強調画像Ｇ１２の推定結果を用いて、スキャン画像Ｇ１１の要素種別を決定する。強調画像Ｇ１２は、スキャン画像Ｇ１１に所定の変調処理を行うことで生成された画像である。こうすることで、スキャン画像Ｇ１１の特定の画素について推定結果に誤りがある場合であっても、強調画像Ｇ１２の対応する画素については精度よく推定することができる可能性がある。すなわち、スキャン画像Ｇ１１の推定結果と、強調画像Ｇ１２の推定結果を用いることで、スキャン画像Ｇ１１の画像情報と、学習用のデータセットの内容とが乖離する部分についても、精度よく推定することが可能となる。すなわち、一つの機械学習器だけを用いて、意味的領域分割の感度を向上させることが可能である。 In contrast, in the region dividing apparatus 10 of this embodiment, the element type of the scan image G11 is determined using the estimation result of the scan image G11 and the estimation result of the emphasized image G12. The emphasized image G12 is an image generated by performing predetermined modulation processing on the scan image G11. In this way, even if there is an error in the estimation result for a specific pixel in the scanned image G11, there is a possibility that the corresponding pixel in the emphasized image G12 can be estimated with high accuracy. That is, by using the estimation result of the scan image G11 and the estimation result of the enhanced image G12, it is possible to accurately estimate even the portion where the image information of the scan image G11 and the content of the learning data set diverge. It becomes possible. That is, it is possible to improve the sensitivity of semantic region segmentation using only one machine learning device.

また、本実施形態の領域分割装置１０では、変調画像生成部１２は、スキャン画像Ｇ１１における画素ごとの画素値を、当該画素がエッジであるか否かに応じて、予め定めた所定の画素値に変更する。これにより、実施形態の領域分割装置１０は、文字要素、線分要素のエッジを強調させた強調画像Ｇ１２を生成することができる。したがって、スキャン画像Ｇ１１の画像情報のみでは精度のよい推定が困難となるような文字、或いは線分が存在する場合であっても、文字要素、及び線分要素を強調させた場合の推定結果を用いて、精度よく推定することが可能となる。 Furthermore, in the region dividing device 10 of the present embodiment, the modulated image generation unit 12 sets the pixel value of each pixel in the scan image G11 to a predetermined pixel value depending on whether the pixel is an edge. Change to Thereby, the region dividing apparatus 10 of the embodiment can generate an emphasized image G12 in which the edges of character elements and line segment elements are emphasized. Therefore, even if there are characters or line segments that are difficult to estimate accurately using only the image information of the scan image G11, the estimation results when character elements and line segment elements are emphasized are Using this method, it becomes possible to estimate with high accuracy.

また、本実施形態の領域分割装置１０では、要素種別推定部１３は、学習済みモデルを用いて、画像における画素の要素種別を推定する。学習済みモデルは、学習用の画像である学習画像の画像情報と、前記学習画像における画素の前記要素種別とを対応付けた情報をデータセットとし、データセットを学習モデルに機械学習させた学習結果である。これにより、本実施形態の領域分割装置１０では、学習済みモデルに画像情報を入力するという容易な方法により、要素種別を推定することができる。 Furthermore, in the region dividing apparatus 10 of this embodiment, the element type estimating unit 13 estimates the element type of a pixel in an image using the learned model. The learned model is a learning result obtained by machine learning the data set using the data set, which is information that associates the image information of the learning image, which is the learning image, with the element type of the pixel in the learning image. It is. Thereby, in the region dividing apparatus 10 of this embodiment, element types can be estimated by a simple method of inputting image information to a trained model.

また、本実施形態の領域分割装置１０では、要素種別決定部１４は、スキャン画像Ｇ１１における所定画素の要素種別、及び強調画像Ｇ１２における対応画素の要素種別のうち、少なくとも一方が文字要素である場合、所定画素が文字要素であると決定する。要素種別決定部１４は、スキャン画像Ｇ１１における所定画素の要素種別、及び強調画像Ｇ１２における対応画素の要素種別のうち、少なくとも一方が線分要素である場合、所定画素が線分要素であると決定する。要素種別決定部１４は、スキャン画像Ｇ１１における所定画素の要素種別、及び強調画像Ｇ１２における対応画素の要素種別が共に、背景要素である場合、所定画素が背景要素であると決定する。これにより、本実施形態の領域分割装置１０では、スキャン画像Ｇ１１の画像情報のみでは、精度のよい推定が困難となるような線分、或いは文字が存在する場合であっても、線分要素、及び文字要素を強調させた場合の推定結果を用いて、精度よく推定することが可能となる。 Furthermore, in the region dividing apparatus 10 of the present embodiment, the element type determination unit 14 determines whether at least one of the element type of a predetermined pixel in the scan image G11 and the element type of the corresponding pixel in the emphasized image G12 is a text element. , the predetermined pixel is determined to be a character element. If at least one of the element type of the predetermined pixel in the scan image G11 and the element type of the corresponding pixel in the emphasized image G12 is a line segment element, the element type determining unit 14 determines that the predetermined pixel is a line segment element. do. If the element type of the predetermined pixel in the scan image G11 and the element type of the corresponding pixel in the emphasized image G12 are both background elements, the element type determination unit 14 determines that the predetermined pixel is a background element. As a result, in the region dividing apparatus 10 of the present embodiment, even if there are line segments or characters that are difficult to estimate accurately using only the image information of the scan image G11, line segment elements, It becomes possible to estimate with high accuracy using the estimation results when character elements are emphasized.

上述した実施形態における領域分割装置１０の全部または一部をコンピュータで実現するようにしてもよい。その場合、この機能を実現するためのプログラムをコンピュータ読み取り可能な記録媒体に記録して、この記録媒体に記録されたプログラムをコンピュータシステムに読み込ませ、実行することによって実現してもよい。なお、ここでいう「コンピュータシステム」とは、ＯＳや周辺機器等のハードウェアを含むものとする。また、「コンピュータ読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ－ＲＯＭ等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置のことをいう。さらに「コンピュータ読み取り可能な記録媒体」とは、インターネット等のネットワークや電話回線等の通信回線を介してプログラムを送信する場合の通信線のように、短時間の間、動的にプログラムを保持するもの、その場合のサーバやクライアントとなるコンピュータシステム内部の揮発性メモリのように、一定時間プログラムを保持しているものも含んでもよい。また上記プログラムは、前述した機能の一部を実現するためのものであってもよく、さらに前述した機能をコンピュータシステムにすでに記録されているプログラムとの組み合わせで実現できるものであってもよく、ＦＰＧＡ等のプログラマブルロジックデバイスを用いて実現されるものであってもよい。 All or part of the region dividing apparatus 10 in the embodiment described above may be realized by a computer. In that case, a program for realizing this function may be recorded on a computer-readable recording medium, and the program recorded on the recording medium may be read into a computer system and executed. Note that the "computer system" herein includes hardware such as an OS and peripheral devices. Furthermore, the term "computer-readable recording medium" refers to portable media such as flexible disks, magneto-optical disks, ROMs, and CD-ROMs, and storage devices such as hard disks built into computer systems. Furthermore, a "computer-readable recording medium" refers to a storage medium that dynamically stores a program for a short period of time, such as a communication line when transmitting a program via a network such as the Internet or a communication line such as a telephone line. It may also include a device that retains a program for a certain period of time, such as a volatile memory inside a computer system that is a server or client in that case. Further, the above-mentioned program may be one for realizing a part of the above-mentioned functions, or may be one that can realize the above-mentioned functions in combination with a program already recorded in the computer system. It may also be realized using a programmable logic device such as an FPGA.

以上、この発明の実施形態について図面を参照して詳述してきたが、具体的な構成はこの実施形態に限られるものではなく、この発明の要旨を逸脱しない範囲の設計等も含まれる。 Although the embodiments of the present invention have been described above in detail with reference to the drawings, the specific configuration is not limited to these embodiments, and includes designs within the scope of the gist of the present invention.

１０…領域分割装置
１１…画像情報取得部（取得部）
１２…変調画像生成部（生成部）
１３…要素種別推定部（推定部）
１４…要素種別決定部（決定部）
１５…領域マップ生成部
１６…マップ情報出力部 10...Region dividing device 11...Image information acquisition unit (acquisition unit)
12...Modulated image generation unit (generation unit)
13... Element type estimation section (estimation section)
14...Element type determination section (determination section)
15... Area map generation unit 16... Map information output unit

Claims

an acquisition unit that acquires image information including pixel values for each pixel in a target image including characters and geometric figures;
Based on the image information acquired by the acquisition unit, the pixel value of each pixel in the target image is changed to the first pixel value when the pixel is an edge, and the pixel value is changed to the first pixel value when the pixel is not an edge. a generation unit that generates a modulated image changed to a second pixel value different from the first pixel value ;
Pixels in the image are character elements that represent elements that constitute characters, geometric elements that represent elements that constitute geometric figures, or elements that constitute background that are not characters or geometric figures. an estimation unit that estimates an element type to distinguish whether it is a background element using the same region divider regardless of the image to be estimated ;
a determining unit that determines the element type for each pixel in the target image based on the result of estimating the element type for each pixel in each of the target image and the modulated image, which is estimated by the estimating unit;
A determination device comprising:

an acquisition unit that acquires image information including pixel values for each pixel in a target image including characters and geometric figures;
a generation unit that generates a modulated image in which pixel coordinates in the target image are moved based on the image information acquired by the acquisition unit;
Pixels in the image are character elements that represent elements that constitute characters, geometric elements that represent elements that constitute geometric figures, or elements that constitute background that are not characters or geometric figures. an estimation unit that estimates an element type to distinguish whether it is a background element using a trained model having a convolutional integral layer ;
a determining unit that determines the element type for each pixel in the target image based on the result of estimating the element type for each pixel in each of the target image and the modulated image, which is estimated by the estimating unit;
A determination device comprising:

The estimation unit estimates the element type of the pixel in the image using the learned model,
The trained model has a data set that includes image information of a learning image, which is a learning image, and information that associates the element type of a pixel in the learning image, and has the learning model perform machine learning on the data set. The learning result is
The determination device according to claim 1 .

The determining unit is
When at least one of the element type of the predetermined pixel in the target image and the element type of the corresponding pixel corresponding to the predetermined pixel in the modulated image is the character element, the element type of the predetermined pixel is It is determined that it is a character element,
If at least one of the element type of the predetermined pixel and the element type of the corresponding pixel is the geometric element, determining that the element type of the predetermined pixel is the geometric element,
When the element type of the predetermined pixel and the element type of the corresponding pixel are both the background element, determining that the element type of the predetermined pixel is the background element;
The determination device according to any one of claims 1 to 3.

the acquisition unit acquires image information including a pixel value for each pixel in a target image including characters and geometric figures;
A generation unit changes the pixel value of each pixel in the target image to a first pixel value when the pixel is an edge, and when the pixel is not an edge, based on the image information acquired by the acquisition unit. generate a modulated image with a second pixel value different from the first pixel value ,
The estimation unit determines whether the pixels in the image are character elements representing elements that constitute a character, geometric elements representing elements that constitute a geometric figure, or constitute a background that is not a character or a geometric figure. The element type that distinguishes whether an element is a background element or not is estimated using the same region divider regardless of the image to be estimated ,
a determining unit determining the element type of the pixel in the target image based on the result of estimating the element type for each pixel in each of the target image and the modulated image, estimated by the estimating unit;
Judgment method.

the acquisition unit acquires image information including a pixel value for each pixel in a target image including characters and geometric figures;
a generation unit generates a modulated image in which pixel coordinates in the target image are moved based on the image information acquired by the acquisition unit;
The estimation unit determines whether the pixels in the image are character elements representing elements that constitute a character, geometric elements representing elements that constitute a geometric figure, or constitute a background that is not a character or a geometric figure. Estimating the element type that distinguishes whether the element is a background element or not using a trained model having a convolutional integral layer ,
a determining unit determining the element type of the pixel in the target image based on the result of estimating the element type for each pixel in each of the target image and the modulated image, estimated by the estimating unit;
Judgment method.

A program for causing a computer to operate as the determination device according to any one of claims 1 to 4, the program for causing the computer to function as each unit included in the determination device.