JP2021125040A

JP2021125040A - Document sorting system, document sorting method and program

Info

Publication number: JP2021125040A
Application number: JP2020018985A
Authority: JP
Inventors: 太郎坂本; Taro Sakamoto; 太櫻井; Futoshi Sakurai
Original assignee: NTT Data Corp
Current assignee: NTT Data Group Corp
Priority date: 2020-02-06
Filing date: 2020-02-06
Publication date: 2021-08-30
Anticipated expiration: 2040-02-06
Also published as: JP7420578B2

Abstract

To provide a document sorting system, a document sorting method and a program capable of sorting subspecies so as to enable OCR character recognition, without correcting a definition body used in the OCR character recognition.SOLUTION: A document sorting system comprises: an acquisition unit that acquires a target document that is a document to be sorted; a type classification unit that classifies the type of the target document using a learnt model; a feature extraction unit that extracts the features of the target document; and a subspecies classification unit that classifies the target document by performing machine learning using the degree of similarity with the features of the document corresponding to the pre-registered definition body for character recognition, based on the classification result by the type classification unit and the extraction result by the feature extraction unit.SELECTED DRAWING: Figure 1

Description

本発明は、帳票仕分システム、帳票仕分方法、及びプログラムに関する。 The present invention relates to a form sorting system, a form sorting method, and a program.

従来、様々な業界において、様々な帳票が利用されている。例えば、紙の帳票をスキャナ等の読取装置で読み取り、読み取り結果の画像からＯＣＲ（Optical Character Reader）技術により文字認識を行うことによりテキスト情報を得ることが行われている。これにより、データ入力などの事務処理を効率化させることが図られている。 Conventionally, various forms have been used in various industries. For example, text information is obtained by reading a paper form with a reading device such as a scanner and performing character recognition from the image of the reading result by OCR (Optical Character Reader) technology. As a result, paperwork such as data entry can be made more efficient.

一般に、帳票は、定型帳票と、準定型帳票と、非定型帳票とに分類される。定型帳票は、項目、記入枠の位置及び大きさが定められており、様式が一つに固定されている帳票である。準定型帳票は、項目は定められているが、記入枠の位置や大きさが定められておらず、異なる様式が複数存在する帳票である。非定型帳票は、項目、記入枠の位置及び大きさが定まった様式が存在しない帳票である。 Generally, forms are classified into standard forms, semi-standard forms, and non-standard forms. A standard form is a form in which items, the position and size of an entry frame are defined, and the format is fixed to one. A semi-standard form is a form in which items are defined, but the position and size of the entry frame are not defined, and there are multiple different formats. An atypical form is a form in which there is no format in which items, the position and size of the entry frame are fixed.

つまり、帳票には、準定型帳票のように、同じ種別の帳票であっても、微妙に異なる様式の帳票（以下、亜種ともいう）が複数存在するという現状がある。このような、様々な亜種が混在していると、ある様式の帳票はＯＣＲによる文字認識ができるが、別の微妙に異なる様式の帳票はＯＣＲによる文字認識ができないといった事象が生じ、帳票のＯＣＲ利用の妨げになっていた。 In other words, there are a plurality of forms (hereinafter, also referred to as subspecies) having slightly different formats even if they are of the same type, such as semi-standard forms. When various variants are mixed in this way, a form in which one form can be recognized by OCR, but another form in a slightly different form cannot be recognized by OCR. It was a hindrance to the use of OCR.

この対策として、特許文献１には、ＯＣＲ文字認識用に定義した定義体の帳票レイアウトを、対象の帳票画像から抽出した罫線レイアウトに応じて補正することにより、様式が類似している帳票群に対して１つの定義体で文字認識を行う技術が開示されているが可能となる。 As a countermeasure, in Patent Document 1, the form layout of the definition program defined for OCR character recognition is corrected according to the ruled line layout extracted from the target form image, so that the forms are similar in style. On the other hand, although a technique for performing character recognition with one definition program is disclosed, it is possible.

特許第６３４２２９２号公報Japanese Patent No. 6342292

しかしながら、亜種の帳票をＯＣＲに読み込ませようとする度に、ＯＣＲ文字認識の定義体を補正すると、補正の手間がかかってしまう。特に、亜種が混在した大量の帳票を文字認識しようとした場合、特許文献１の技術を適用することは非効率であり現実的でないという問題があった。 However, if the definition program of OCR character recognition is corrected every time the OCR is to read the variant form, it takes time and effort to correct it. In particular, when trying to recognize a large number of forms in which variants are mixed, there is a problem that it is inefficient and impractical to apply the technique of Patent Document 1.

本発明は、上記問題を解決すべくなされたもので、その目的は、ＯＣＲ文字認識に用いる定義体を補正することなく、ＯＣＲ文字認識ができるように亜種を仕分けることができる帳票仕分システム、帳票仕分方法、及びプログラムを提供することにある。 The present invention has been made to solve the above problems, and an object of the present invention is a form sorting system capable of sorting variants so that OCR character recognition can be performed without correcting the definition program used for OCR character recognition. The purpose is to provide a form sorting method and a program.

上記問題を解決するために、本発明の一態様は、仕分対象の帳票である対象帳票を取得する取得部と、学習済みモデルを用いて、前記対象帳票の種別を分類する種別分類部と、前記対象帳票の特徴を抽出する特徴抽出部と、前記種別分類部による分類結果、及び前記特徴抽出部による抽出結果に基づき、予め登録された文字認識の定義体に対応する帳票の特徴との類似度合いを用いて機械学習を行うことにより前記対象帳票を分類する亜種分類部と、を備え、前記学習済みモデルは、学習用帳票を入力させることにより得られる出力が、当該学習用帳票に対応する種別に近づくように学習されたモデルであって、入力された帳票の種別を予測するモデルである、ことを特徴とする帳票仕分システムである。 In order to solve the above problem, one aspect of the present invention includes an acquisition unit that acquires a target form, which is a form to be sorted, and a type classification unit that classifies the types of the target form using a learned model. Similarity between the feature extraction unit that extracts the features of the target form and the features of the form corresponding to the pre-registered character recognition definition body based on the classification result by the type classification unit and the extraction result by the feature extraction unit. The trained model is provided with a subspecies classification unit that classifies the target form by performing machine learning using the degree, and the output obtained by inputting the learning form corresponds to the learning form. It is a form sorting system characterized in that it is a model learned to approach the type to be performed and is a model for predicting the type of the input form.

また、本発明の一態様は、仕分対象の帳票である対象帳票を取得する取得部と、前記対象帳票の特徴を抽出する特徴抽出部と、前記特徴抽出部による抽出結果に基づき、予め定義された文字認識の定義体に対応する帳票の特徴との類似度合いを用いて機械学習を行うことにより前記対象帳票を分類する亜種分類部と、を備える帳票仕分システムである。 Further, one aspect of the present invention is defined in advance based on an acquisition unit that acquires a target form, which is a form to be sorted, a feature extraction unit that extracts features of the target form, and an extraction result by the feature extraction unit. It is a form sorting system including a subtype classification unit that classifies the target form by performing machine learning using the degree of similarity with the feature of the form corresponding to the definition structure of the character recognition.

また、本発明の一態様は、上記に記載の帳票仕分システムにおいて、前記亜種分類部は、前記特徴抽出部によって抽出された罫線の特徴と、前記定義体に対応する帳票における罫線の特徴との類似度合いを用いたクラスタ分析を行うことにより前記対象帳票を分類する、するようにしてもよい。 Further, one aspect of the present invention is that in the form sorting system described above, the subspecies classification unit has the characteristics of the ruled lines extracted by the feature extraction unit and the characteristics of the ruled lines in the form corresponding to the definition program. The target forms may be classified by performing a cluster analysis using the degree of similarity of.

また、本発明の一態様は、上記に記載の帳票仕分システムにおいて、前記定義体に対応する帳票は、登録用帳票に上記クラスタ分析を行うことにより得られるクラスタ内の帳票から選択された帳票であるようにしてもよい。 Further, in one aspect of the present invention, in the form sorting system described above, the form corresponding to the definition program is a form selected from the forms in the cluster obtained by performing the cluster analysis on the registration form. It may be.

また、本発明の一態様は、上記に記載の帳票仕分システムにおいて、前記亜種分類部による分類結果に基づき、前記定義体に対応する帳票と同一グループに分類された前記対象帳票が、前記定義体を用いた文字認識に適合するか否かを判定する適合判定部を更に備えるようにしてもよい。 Further, in one aspect of the present invention, in the form sorting system described above, the target form classified into the same group as the form corresponding to the definition program based on the classification result by the subspecies classification unit is defined as described above. A conformity determination unit for determining whether or not the character recognition using the body is suitable may be further provided.

また、本発明の一態様は、上記に記載の帳票仕分システムにおいて、前記適合判定部は、前記定義体に対応する帳票と同一グループに分類された前記対象帳票における罫線の特徴と、前記定義体に対応する帳票における罫線の特徴との類似度合いに基づき、前記定義体を用いた文字認識に適合するか否かを判定するようにしてもよい。 Further, one aspect of the present invention is that in the form sorting system described above, the conformity determination unit has features of ruled lines in the target form classified into the same group as the form corresponding to the definition program, and the definition program. Based on the degree of similarity with the characteristics of the ruled lines in the form corresponding to the above, it may be determined whether or not the character recognition using the definition program is suitable.

また、本発明の一態様は、取得部が、仕分対象の帳票である対象帳票を取得し、種別分類部が、学習済みモデルを用いて、前記対象帳票の種別を分類し、特徴抽出部が、前記対象帳票における罫線の特徴を抽出し、亜種分類部が、前記種別分類部による分類結果、及び前記特徴抽出部による抽出結果に基づき、予め定義された文字認識の定義体に対応する帳票との類似度合を用いて機械学習を行うことにより前記対象帳票を分類する、帳票仕分方法であって、前記学習済みモデルは、学習用帳票を入力させることにより得られる出力が、当該学習用帳票に対応する種別に近づくように学習されたモデルであって、入力された帳票の種別を予測するモデルである、ことを特徴とする帳票仕分方法である。 Further, in one aspect of the present invention, the acquisition unit acquires the target form, which is the form to be sorted, the type classification unit classifies the type of the target form using the learned model, and the feature extraction unit , The feature of the ruled line in the target form is extracted, and the subtype classification unit corresponds to the definition structure of the character recognition defined in advance based on the classification result by the type classification unit and the extraction result by the feature extraction unit. It is a form sorting method that classifies the target form by performing machine learning using the degree of similarity with, and in the trained model, the output obtained by inputting the learning form is the learning form. It is a form sorting method characterized in that it is a model learned so as to approach the type corresponding to, and is a model for predicting the type of the input form.

また、本発明の一態様は、コンピュータを、仕分対象の帳票である対象帳票を取得する取得手段、学習済みモデルを用いて、前記対象帳票の種別を分類する種別分類手段、前記対象帳票における罫線の特徴を抽出する特徴抽出手段、前記種別分類手段による分類結果、及び前記特徴抽出手段による抽出結果に基づき、予め定義された文字認識の定義体に対応する帳票との類似度合を用いて機械学習を行うことにより前記対象帳票を分類する亜種分類手段、として機能させるためのプログラムであって、前記学習済みモデルは、学習用帳票を入力させることにより得られる出力が、当該学習用帳票に対応する種別に近づくように学習されたモデルであって、入力された帳票の種別を予測するモデルであるプログラムである。 Further, one aspect of the present invention is an acquisition means for acquiring a target form which is a form to be sorted, a type classification means for classifying the type of the target form using a learned model, and a ruled line in the target form. Based on the feature extraction means for extracting the features of, the classification result by the type classification means, and the extraction result by the feature extraction means, machine learning is performed using the degree of similarity with the form corresponding to the predefined character recognition definition program. It is a program for functioning as a subspecies classification means for classifying the target form by performing the above, and in the trained model, the output obtained by inputting the learning form corresponds to the learning form. It is a program that is a model that is learned so as to approach the type to be performed and that predicts the type of the input form.

この発明によれば、ＯＣＲ読み取り用の定義体を補正することなく、ＯＣＲ読み取りができるように亜種を仕分けることができる。 According to the present invention, variants can be sorted so that OCR reading can be performed without correcting the definition program for OCR reading.

実施形態の帳票認識システム１の構成例を示すブロック図である。It is a block diagram which shows the structural example of the form recognition system 1 of embodiment. 実施形態の種別分類装置１０の構成例を示すブロック図である。It is a block diagram which shows the structural example of the type classification apparatus 10 of an embodiment. 実施形態の亜種分類装置２０の構成例を示すブロック図である。It is a block diagram which shows the structural example of the subspecies classification apparatus 20 of an embodiment. 実施形態の種別分類装置１０が行う処理の流れを示すフロー図である。It is a flow chart which shows the flow of the process performed by the type classification apparatus 10 of an embodiment. 実施形態の帳票認識システム１が行う処理の流れを示すシーケンス図である。It is a sequence diagram which shows the flow of the process performed by the form recognition system 1 of embodiment. 実施形態の亜種分類装置２０が行う処理の流れを示すフロー図である。It is a flow chart which shows the flow of the process performed by the subspecies classification apparatus 20 of an embodiment. 実施形態の亜種分類装置２０が行う処理を説明する図である。It is a figure explaining the process performed by the subspecies classification apparatus 20 of an embodiment. 実施形態の帳票認識システム１が行う処理の流れを示すシーケンス図である。It is a sequence diagram which shows the flow of the process performed by the form recognition system 1 of embodiment. 実施形態の種別分類装置１０が行う処理の流れを示すフロー図である。It is a flow chart which shows the flow of the process performed by the type classification apparatus 10 of an embodiment. 実施形態の亜種分類装置２０が行う処理の流れを示すフロー図である。It is a flow chart which shows the flow of the process performed by the subspecies classification apparatus 20 of an embodiment.

以下、本発明の実施形態について図面を参照して説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

図１は、実施形態の帳票認識システム１の構成例を示すブロック図である。帳票認識システム１は、例えば、帳票仕分システム１００と、文字認識装置３０とを備える。 FIG. 1 is a block diagram showing a configuration example of the form recognition system 1 of the embodiment. The form recognition system 1 includes, for example, a form sorting system 100 and a character recognition device 30.

帳票仕分システム１００は、様々な様式の帳票を、同一のＯＣＲ定義体により読み取り可能なグループに仕分するシステムである。ここで、ＯＣＲ定義体とは、ＯＣＲ文字認識の対象となる帳票に関する情報であって、ＯＣＲ文字認識に用いられる情報である。ＯＣＲ定義体には、例えば、罫線の本数や長さ、配置などを示す罫線のレイアウト情報、及び帳票のタイトルや項目名称などを示す帳票の固有情報が含まれる。 The form sorting system 100 is a system that sorts forms of various formats into groups that can be read by the same OCR definition program. Here, the OCR definition program is information about a form that is the target of OCR character recognition, and is information used for OCR character recognition. The OCR definition program includes, for example, ruled line layout information indicating the number, length, and arrangement of ruled lines, and form-specific information indicating the form title, item name, and the like.

本実施形態において、仕分けの対象となる帳票（以下、対象帳票ともいう）は、源泉徴収票、給与明細書、各種の帳簿や伝票、申込書など、業務や取引または申請などに必要な情報の記入や印刷のために用いられる書類であって、罫線などにより項目欄や記入枠が形成され、定められた位置に定められた記載がなされるようにレイアウトされた書類である。 In the present embodiment, the form to be sorted (hereinafter, also referred to as the target form) is information necessary for business, transaction, application, etc., such as withholding slip, salary statement, various books and slips, application form, etc. It is a document used for entry and printing, and is a document laid out so that an item column and an entry frame are formed by ruled lines and the like, and a defined description is made at a predetermined position.

帳票仕分システム１００は、例えば、種別分類装置１０と、複数の亜種分類装置２０（亜種分類装置２０−１、２０−２、…、２０−Ｎ）とを備える。Ｎは、種別分類装置１０により分類された種別の数に応じて決定される自然数である。 The form sorting system 100 includes, for example, a type classification device 10 and a plurality of subspecies classification devices 20 (subspecies classification devices 20-1, 20-2, ..., 20-N). N is a natural number determined according to the number of types classified by the type classification device 10.

種別分類装置１０は、機械学習の手法を用いて、対象帳票を、その種別ごとに分類するコンピュータである。種別分類装置１０が用いる機械学習の手法は、既存の任意の学習手法であってよいが、例えば、教師あり学習であり、ＣＮＮ（Convolutional Neural Network）等による深層学習（ディープラーニング）のモデルを用いた手法である。学習済みモデルを用いた分類が行われる場合、種別分類装置１０が帳票をどのように分類するかは、学習済みモデルにどのようなデータを機械学習させるかにより決定される。学習済みモデルについては、後で詳しく説明する。 The type classification device 10 is a computer that classifies target forms for each type by using a machine learning method. The machine learning method used by the classification device 10 may be any existing learning method, but for example, it is supervised learning and uses a deep learning model by CNN (Convolutional Neural Network) or the like. This is the method that was used. When classification is performed using the trained model, how the classification device 10 classifies the forms is determined by what kind of data is machine-learned by the trained model. The trained model will be described in detail later.

種別分類装置１０は、対象帳票を、一見して見た目が異なるものごとに分類する。例えば、種別分類装置１０は、帳票のタイトルごとに対象帳票を分類する。或いは、種別分類装置１０は、帳票の様式ごとに、対象帳票を分類する。換言すると、種別分類装置１０は、一見して見た目が変わらない対象帳票を、同じ種別に分類する。つまり、種別分類装置１０は、亜種を区別せず、同じ種別として分類する。具体的に、種別分類装置１０が帳票のタイトルごとに対象帳票を分類する場合、帳票のタイトルが同じであって記入枠の位置や大きさが微妙に異なる様式の対象帳票（亜種）が複数ある場合、これらの亜種を同じ種別の帳票として分類する。種別分類装置１０は、分類結果を亜種分類装置２０に出力する。 The type classification device 10 classifies the target forms according to those having different appearances at first glance. For example, the type classification device 10 classifies the target form according to the title of the form. Alternatively, the type classification device 10 classifies the target form according to the form of the form. In other words, the type classification device 10 classifies target forms whose appearance does not change at first glance into the same type. That is, the type classification device 10 does not distinguish between subspecies and classifies them as the same type. Specifically, when the type classification device 10 classifies the target form for each form title, there are a plurality of target forms (subspecies) having the same form title but slightly different positions and sizes of the entry frames. If so, classify these variants as the same type of form. The classification device 10 outputs the classification result to the subspecies classification device 20.

亜種分類装置２０は、種別分類装置１０により同じ種別に分類された帳票群を、その亜種ごとに分類するコンピュータである。亜種分類装置２０は、種別分類装置１０から分類結果を取得する。亜種分類装置２０は、取得した情報に基づいて、同じ種別に分類された帳票群のそれぞれの特徴量を抽出する。ここでの特徴量は、亜種を分類するために必要な帳票の特徴を示す度合いであり、例えば、帳票に用いられている罫線の態様（例えば、罫線の間隔など）である。 The subspecies classification device 20 is a computer that classifies a group of forms classified into the same type by the type classification device 10 for each subspecies. The subspecies classification device 20 acquires the classification result from the type classification device 10. The subspecies classification device 20 extracts each feature amount of the form group classified into the same type based on the acquired information. The feature amount here is a degree indicating the characteristics of the form necessary for classifying the subspecies, and is, for example, the mode of the ruled line used in the form (for example, the interval of the ruled line).

亜種分類装置２０は、抽出した特徴量を用いて機械学習を行うことにより亜種を分類する。亜種分類装置２０が用いる機械学習の手法は、既存の任意の手法であってよいが、例えば、教師なし学習であり、クラスタ分析を用いた手法である。 The subspecies classification device 20 classifies subspecies by performing machine learning using the extracted features. The machine learning method used by the subspecies classification device 20 may be any existing method, but for example, it is unsupervised learning and is a method using cluster analysis.

亜種分類装置２０は、同一のＯＣＲ定義体を用いた文字認識が可能となる範囲で亜種を分類する。これにより、ある亜種はＯＣＲによる文字認識ができるが、別の亜種はＯＣＲによる文字認識ができないといった事象を生じ難くすることができる。したがって、帳票のＯＣＲ利用を促進することが可能である。 The subspecies classification device 20 classifies subspecies within a range in which character recognition using the same OCR definition program is possible. As a result, it is possible to make it difficult for one variant to recognize characters by OCR, but another variant to be unable to recognize characters by OCR. Therefore, it is possible to promote the use of OCR for forms.

亜種分類装置２０は、予め登録したＯＣＲ定義体に対応する帳票（以下、代表帳票ともいう）の特徴量と、対象帳票の特徴量の類似度合いに基づいて、亜種を分類する。つまり、亜種分類装置２０は、代表帳票と似た特徴を有する亜種を、同じグループに分類する。代表帳票は、予め登録済みであり、ＯＣＲ文字認識できるように、すでにＯＣＲ定義体が生成されている帳票である。これにより、代表帳票と似た特徴を有すると分類された亜種は、その代表帳票に対応するＯＣＲ定義体を用いてＯＣＲ文字認識を行うことができる可能性が高い。亜種分類装置２０が亜種を分類する方法については、後で詳しく説明する。亜種分類装置２０は、亜種を分類した分類結果を文字認識装置３０に出力する。 The subspecies classification device 20 classifies subspecies based on the degree of similarity between the feature amount of the form (hereinafter, also referred to as the representative form) corresponding to the pre-registered OCR definition program and the feature amount of the target form. That is, the subspecies classification device 20 classifies subspecies having characteristics similar to the representative form into the same group. The representative form is a form that has been registered in advance and an OCR definition structure has already been generated so that OCR characters can be recognized. As a result, it is highly possible that the subspecies classified as having characteristics similar to the representative form can perform OCR character recognition using the OCR definition program corresponding to the representative form. The method by which the subspecies classification device 20 classifies subspecies will be described in detail later. The subspecies classification device 20 outputs the classification result of classifying the subspecies to the character recognition device 30.

文字認識装置３０は、ＯＣＲ文字認識を行うコンピュータである。文字認識装置３０には、複数の代表帳票のそれぞれに対応するＯＣＲ定義体が登録されている。亜種分類装置２０により指定された代表帳票に基づいて、作業者等によりＯＣＲ定義体が生成され、生成されたＯＣＲ定義体が、文字認識装置３０に登録（記憶）される。なお、文字認識装置３０が帳票に基づくＯＣＲ定義体を生成する機能を有する場合、文字認識装置３０は、亜種分類装置２０により指定された代表帳票に対応するＯＣＲ定義体を生成するようにしてもよい。亜種分類装置２０が代表帳票を指定する方法については後で詳しく説明する。 The character recognition device 30 is a computer that performs OCR character recognition. An OCR definition program corresponding to each of the plurality of representative forms is registered in the character recognition device 30. An OCR definition program is generated by an operator or the like based on the representative form designated by the subspecies classification device 20, and the generated OCR definition program is registered (stored) in the character recognition device 30. When the character recognition device 30 has a function of generating an OCR definition structure based on the form, the character recognition device 30 is configured to generate an OCR definition structure corresponding to the representative form designated by the subspecies classification device 20. May be good. The method by which the subspecies classification device 20 designates the representative form will be described in detail later.

文字認識装置３０は、亜種分類装置２０から分類結果を取得する。文字認識装置３０は、代表帳票と似た特徴を有すると分類された亜種を、その代表帳票に対応するＯＣＲ定義体を用いてＯＣＲ文字認識を行う。 The character recognition device 30 acquires the classification result from the subspecies classification device 20. The character recognition device 30 performs OCR character recognition on a subspecies classified as having characteristics similar to the representative form using the OCR definition program corresponding to the representative form.

図１の例では、対象帳票が、種別分類装置１０によりＮ個の種別（種別１、種別２、…種別Ｎ）のいずれかに分類される構成が示されている。また、それぞれの種別に分類された帳票群が、亜種分類装置２０のそれぞれにより複数の亜種に分類される構成が示されている。例えば、亜種分類装置２０−１により複数の亜種（種別１亜種Ａ、種別１亜種Ｂ、…）に分類される構成が示されている。亜種分類装置２０−２により複数の亜種（種別２亜種Ａ、種別２亜種Ｂ、…）に分類される構成が示されている。亜種分類装置２０−Ｎにより複数の亜種（種別Ｎ亜種Ａ、種別Ｎ亜種Ｂ、…）に分類される構成が示されている。 In the example of FIG. 1, a configuration is shown in which the target form is classified into any of N types (type 1, type 2, ... Type N) by the type classification device 10. Further, a configuration is shown in which the form group classified into each type is classified into a plurality of subspecies by each of the subspecies classification devices 20. For example, a configuration is shown in which the subspecies classification device 20-1 classifies a plurality of subspecies (type 1 subspecies A, type 1 subspecies B, ...). The subspecies classification apparatus 20-2 shows a configuration in which the subspecies are classified into a plurality of subspecies (type 2 subspecies A, type 2 subspecies B, ...). A configuration is shown in which the subspecies classification device 20-N classifies the subspecies into a plurality of subspecies (type N subspecies A, type N subspecies B, ...).

図２は、実施形態の種別分類装置１０の構成例を示すブロック図である。種別分類装置１０は、例えば、対象画像取得部１１と、学習用画像取得部１２と、前処理部１３と、学習部１４と、予測部１５と、種別分類部１６と、出力部１７と、学習済みモデルパラメータ記憶部１８とを備える。 FIG. 2 is a block diagram showing a configuration example of the type classification device 10 of the embodiment. The type classification device 10 includes, for example, a target image acquisition unit 11, a learning image acquisition unit 12, a preprocessing unit 13, a learning unit 14, a prediction unit 15, a type classification unit 16, and an output unit 17. A trained model parameter storage unit 18 is provided.

種別分類装置１０が行う処理には、「事前準備」と、「分類実行」との２つの段階がある。「事前準備」の段階において、対象帳票を種別ごとに分類する前に、種別分類装置１０により実際の分類に用いる学習済みモデルが準備される。「分類実行」の段階において、１０により、対象帳票を種別ごとに分類する実際の分類が行われる。以下、「事前準備」と、「分類実行」との２つの段階について、順に説明する。 The process performed by the classification device 10 has two stages, "preparation" and "classification execution". In the "preparation" stage, the type classification device 10 prepares a trained model to be used for actual classification before classifying the target forms by type. At the stage of "classification execution", the actual classification for classifying the target forms by type is performed by 10. Hereinafter, the two stages of "preparation" and "classification execution" will be described in order.

（事前準備）
種別分類装置１０は、事前準備として、学習済みモデルを生成する。学習済みモデルは、学習用の帳票（以下、学習用帳票ともいう）と、その種別との対応関係を学習することにより、入力された未学習の帳票の種別を予測できるように学習されたモデルである。すなわち、学習済みモデルは、学習用帳票を入力させることにより得られる出力が、当該学習用帳票に対応する種別に近づくように学習されたモデルであって、入力された帳票の種別を予測するモデルである。 (Advance preparation)
The classification device 10 generates a trained model as a preliminary preparation. The trained model is a model trained so that the type of the input unlearned form can be predicted by learning the correspondence between the learning form (hereinafter, also referred to as the learning form) and the type. Is. That is, the trained model is a model that is trained so that the output obtained by inputting the learning form approaches the type corresponding to the learning form, and is a model that predicts the type of the input form. Is.

学習用画像取得部１２は、学習用帳票の画像データを取得する。学習用帳票は、例えば、標準的なフォーマットとして既に公開されている帳票や、過去に利用された実績のある現物の帳票である。画像データは、例えば、紙の帳票をスキャナ等の読取装置で読み取った画像の電子情報である。学習用画像取得部１２は、例えば、種別分類装置１０と接続されたスキャナにより読み取られた画像の情報を取得する。或いは、学習用画像取得部１２は、外部のＤＢ（データベース）サーバ装置に蓄積された学習用帳票の画像データを、ネットワーク等を介して取得するようにしてもよい。学習用画像取得部１２は、取得した学習用帳票の画像データを前処理部１３に出力する。 The learning image acquisition unit 12 acquires the image data of the learning form. The learning form is, for example, a form that has already been published as a standard format, or an actual form that has been used in the past. The image data is, for example, electronic information of an image obtained by reading a paper form with a reading device such as a scanner. The learning image acquisition unit 12 acquires, for example, image information read by a scanner connected to the classification device 10. Alternatively, the learning image acquisition unit 12 may acquire the image data of the learning form stored in the external DB (database) server device via a network or the like. The learning image acquisition unit 12 outputs the acquired image data of the learning form to the preprocessing unit 13.

前処理部１３は、学習モデルに学習させるデータセットを生成する。学習モデルは、学習済みモデルに学習させる前のモデルであり、例えばＣＮＮ等による深層（多層）モデルである。前処理部１３は、学習モデルに学習させるデータとして、学習データと教師データとを対応づけたデータセットを生成する。学習データは、学習モデルに入力させるデータであり、学習用画像取得部１２によって取得された学習用帳票の画像データである。教師データは、学習モデルから出力される予測値の誤差を算出するためのデータであり、学習用帳票の種別を示す情報である。前処理部１３は、学習データとしての学習用帳票に、教師データとしてのその学習用帳票の種別を対応付けることにより学習モデルに学習させるデータセットを生成する。前処理部１３は、生成したデータセットを学習部１４に出力する。 The preprocessing unit 13 generates a data set to be trained by the learning model. The learning model is a model before the trained model is trained, and is, for example, a deep (multilayer) model by CNN or the like. The preprocessing unit 13 generates a data set in which the learning data and the teacher data are associated with each other as the data to be trained by the learning model. The learning data is data to be input to the learning model, and is image data of the learning form acquired by the learning image acquisition unit 12. The teacher data is data for calculating the error of the predicted value output from the learning model, and is information indicating the type of the learning form. The preprocessing unit 13 generates a data set to be trained by the learning model by associating the learning form as the learning data with the type of the learning form as the teacher data. The preprocessing unit 13 outputs the generated data set to the learning unit 14.

学習部１４は、前処理部１３により生成された学習用のデータセットを用いて、学習モデルを学習させる。学習部１４は、学習モデルに、データセットの学習データを入力させる。学習部１４は、誤差逆伝搬法などの手法を用いて、学習モデルから出力されたデータ（予測値）が、当該学習データに対応する教師データ（種別）に近づくように、学習モデルのパラメータを調整する。学習部１４は、学習モデルの出力層から出力される予測値の誤差が所定の閾値以下となるなど、所定の終了条件を満たしたと判定される場合に、学習モデルの学習を終了させる。学習部１４は、学習を終了させた際の学習モデルを学習済みモデルとして確定させる。学習部１４は、学習を終了させた際の学習モデルに設定されていたパラメータを学習済みモデルパラメータ記憶部１８に記憶させる。ここでのパラメータは、学習済みモデルを生成するための用いられる変数であって、例えば、ＣＮＮの学習モデルを用いて学習済みモデルが生成された場合であれば、ＣＮＮの入力層、中間層、出力層の各層のユニット数、隠れ層の層数、活性化関数などを示す情報や、各階層のノードを結合する結合係数や重みを示す情報である。 The learning unit 14 trains the learning model using the learning data set generated by the preprocessing unit 13. The learning unit 14 causes the learning model to input the learning data of the data set. The learning unit 14 uses a method such as an error back propagation method to set the parameters of the learning model so that the data (predicted value) output from the learning model approaches the teacher data (type) corresponding to the learning data. adjust. The learning unit 14 ends the learning of the learning model when it is determined that the predetermined end condition is satisfied, such as the error of the predicted value output from the output layer of the learning model is equal to or less than a predetermined threshold value. The learning unit 14 determines the learning model at the end of learning as a learned model. The learning unit 14 stores the parameters set in the learning model when the learning is completed in the learned model parameter storage unit 18. The parameters here are variables used to generate a trained model, and for example, if a trained model is generated using a CNN training model, the CNN input layer, intermediate layer, and so on. Information indicating the number of units in each layer of the output layer, the number of layers in the hidden layer, the activation function, etc., and information indicating the coupling coefficient and weight that connect the nodes of each layer.

学習済みモデルパラメータ記憶部１８は、学習部１４により生成された学習済みモデルのパラメータを記憶する。 The trained model parameter storage unit 18 stores the parameters of the trained model generated by the learning unit 14.

（分類実行）
種別分類装置１０は、分類実行の段階において、対象帳票の種別を分類する。 (Classification execution)
The type classification device 10 classifies the type of the target form at the stage of classifying execution.

対象画像取得部１１は、対象帳票の画像データを取得する。対象画像取得部１１が対象帳票の画像データを取得する方法は、学習用画像取得部１２が登録用帳票の画像データを取得する方法と同様であるためその説明を省略する。対象画像取得部１１は、取得した対象画像の画像データを予測部１５に出力する。 The target image acquisition unit 11 acquires the image data of the target form. The method in which the target image acquisition unit 11 acquires the image data of the target form is the same as the method in which the learning image acquisition unit 12 acquires the image data of the registration form, and thus the description thereof will be omitted. The target image acquisition unit 11 outputs the acquired image data of the target image to the prediction unit 15.

予測部１５は、対象画像の種別を予測する。予測部１５は、対象画像取得部１１から対象画像の画像データを取得する。予測部１５は、学習済みモデルパラメータ記憶部１８を参照することにより、学習部１４により生成された学習済みモデルを取得（再構築）する。予測部１５は、学習済みモデルに対象画像を入力して得られる出力を予測結果とする。予測部１５は、予測結果を種別分類部１６に出力する。 The prediction unit 15 predicts the type of the target image. The prediction unit 15 acquires image data of the target image from the target image acquisition unit 11. The prediction unit 15 acquires (reconstructs) the trained model generated by the learning unit 14 by referring to the trained model parameter storage unit 18. The prediction unit 15 uses the output obtained by inputting the target image into the trained model as the prediction result. The prediction unit 15 outputs the prediction result to the classification unit 16.

ここで、学習済みモデルは、予測結果を、その確信度と共に出力する。ここでの確信度とは、予測した種別の確からしさであり、例えば学習済みモデルが予測した種別である確率を示す情報である。例えば、モデルの活性化関数にＳｏｆｔｍａｘ関数を用いることにより、学習済みモデルから、予測結果の確立（確信度合い）を出力させることが可能である。例えば、学習済みモデルは、対象帳票が種別１（例えば、確定申告書）である確率が９０％である、という予測結果を出力する。例えば、学習済みモデルは、対象帳票が種別１（例えば、確定申告書）である確率が５５％で、種別２（例えば、審査請求書）である確率が４０％である、というような予測結果を出力する。なお、確信度は、少なくとも予測した種別の確からしさを示す度合いであればよく、確率に限定されない。例えば、確信度は、（確からしさが）「高い」か「低い」かを示す二値の情報であってもよいし、「高い」、「やや高い」、「やや低い」、「低い」等、複数のレベルを示す情報であってもよい。 Here, the trained model outputs the prediction result together with its certainty. The degree of certainty here is the certainty of the predicted type, and is information indicating, for example, the probability that the trained model is the predicted type. For example, by using the Softmax function as the activation function of the model, it is possible to output the establishment (certainty degree) of the prediction result from the trained model. For example, the trained model outputs a prediction result that the probability that the target form is type 1 (for example, a final tax return) is 90%. For example, in the trained model, the probability that the target form is type 1 (for example, final tax return) is 55%, and the probability that the target form is type 2 (for example, examination request) is 40%. Is output. It should be noted that the degree of certainty may be at least a degree indicating the certainty of the predicted type, and is not limited to the probability. For example, the certainty may be binary information indicating whether it is "high" or "low" (certainty), "high", "slightly high", "slightly low", "low", etc. , Information indicating a plurality of levels may be used.

種別分類部１６は、予測部１５からの予測結果に基づいて、対象帳票の種別を確定させる。種別分類部１６は、例えば、確信度が所定の閾値以上であるもののうち、最も確信度が高い種別を、その対象帳票の種別であると判定する。種別分類部１６は、確信度が所定の閾値未満である場合、その対象帳票の種別が不明であると判定する。種別分類部１６は、対象帳票の種別を判定した判定結果を、出力部１７を介して出力する。 The type classification unit 16 determines the type of the target form based on the prediction result from the prediction unit 15. The type classification unit 16 determines, for example, the type having the highest degree of certainty among those having a certainty degree equal to or higher than a predetermined threshold value as the type of the target form. When the certainty level is less than a predetermined threshold value, the type classification unit 16 determines that the type of the target form is unknown. The type classification unit 16 outputs the determination result of determining the type of the target form via the output unit 17.

ここで、種別分類部１６は、判定した種別に応じて、出力先を変更するようにしてもよい。例えば、種別分類部１６は、判定した種別がテキスト化の対象となる種別である場合、対象帳票の種別を判定した判定結果を亜種分類装置２０に出力する。一方、種別分類部１６は、判定した種別がテキスト化の対象とならない種別である場合には、判定結果を他の装置に出力する。他の装置は、例えば、テキスト化の対象としない帳票の画像データが集約されるデータベースである。 Here, the type classification unit 16 may change the output destination according to the determined type. For example, when the determined type is the type to be converted into text, the type classification unit 16 outputs the determination result of determining the type of the target form to the subspecies classification device 20. On the other hand, when the determined type is a type that is not the target of text conversion, the type classification unit 16 outputs the determination result to another device. Another device is, for example, a database in which image data of forms that are not targeted for text conversion are aggregated.

また、種別分類部１６は、確信度が所定の閾値未満である場合、その旨を示す警告を、作業者が知覚可能となるように、例えば図示しない表示部に表示させるようにしてもよい。これにより、種別分類部１６は、種別が不明の対象帳票があることを、作業者に知らせることができる。作業者は、警告に応じて、種別不明の対象帳票を目視で確認する等して、個別の対応を行う、或いは、学習済みモデルを再学習させるか等の対応を行うことが可能となる。なお、学習済みモデルを再学習させる場合には、学習用帳票に、種別不明の対象帳票と、その種別とを対応付けたデータセットを含めるようにする。これにより、再学習後の学習済みモデルにより、再学習前のモデルで種別不明と予測された帳票の種別を、精度よく予測することが可能となる。 Further, when the certainty level is less than a predetermined threshold value, the classification unit 16 may display a warning to that effect on, for example, a display unit (not shown) so that the operator can perceive it. As a result, the type classification unit 16 can notify the worker that there is a target form whose type is unknown. In response to the warning, the worker can visually check the target form of unknown type and take individual measures, or take measures such as re-learning the trained model. When retraining the trained model, the training form includes a target form of unknown type and a data set in which the type is associated with the target form. As a result, it is possible to accurately predict the type of the form predicted to be unknown in the model before relearning by the trained model after relearning.

図３は、実施形態の亜種分類装置２０の構成例を示すブロック図である。亜種分類装置２０は、例えば、対象画像取得部２１と、定義体登録用画像取得部２２と、罫線抽出部２３と、類似度算出部２４と、亜種クラスタリング部２５と、適合判定部２６と、亜種分類部２７と、出力部２８とを備える。 FIG. 3 is a block diagram showing a configuration example of the subspecies classification device 20 of the embodiment. The subspecies classification device 20 includes, for example, a target image acquisition unit 21, a definition program registration image acquisition unit 22, a ruled line extraction unit 23, a similarity calculation unit 24, a subspecies clustering unit 25, and a conformity determination unit 26. , A subspecies classification unit 27, and an output unit 28.

亜種分類装置２０が行う処理には、「事前準備」と、「分類実行」との２つの段階がある。「事前準備」の段階において、同一の種別に分類された対象帳票群を亜種ごとに分類する前に、分類の基準となる代表帳票と、その代表帳票に対応するＯＣＲ定義体とが準備される。「分類実行」の段階において、同一の種別に分類された対象帳票群を亜種ごとに分類する、実際の分類が亜種分類装置２０により行われる。以下、「事前準備」と、「分類実行」との２つの段階について、順に説明する。 The process performed by the subspecies classification device 20 has two stages, "preparation" and "classification execution". At the stage of "preparation", before classifying the target forms classified into the same type by subspecies, the representative form that is the standard of classification and the OCR definition structure corresponding to the representative form are prepared. NS. At the stage of "classification execution", the subspecies classification device 20 performs the actual classification in which the target form groups classified into the same type are classified by subspecies. Hereinafter, the two stages of "preparation" and "classification execution" will be described in order.

（事前準備）
亜種分類装置２０は、事前準備として、代表帳票を選択する。代表帳票は、ＯＣＲ文字認識を行う場合に用いられるＯＣＲ定義体が生成される帳票である。代表帳票を基準として、代表帳票と似た特徴を有する亜種を、同じグループに分類することにより、その代表帳票に対応するＯＣＲ定義体を用いてＯＣＲ文字認識ができるようにする。 (Advance preparation)
The subspecies classification device 20 selects a representative form as a preliminary preparation. The representative form is a form in which an OCR definition program used when performing OCR character recognition is generated. By classifying variants having characteristics similar to the representative form into the same group based on the representative form, OCR character recognition can be performed using the OCR definition program corresponding to the representative form.

定義体登録用画像取得部２２は、ＯＣＲ定義体を登録するための帳票（以下、登録用帳票ともいう）の画像データを取得する。登録用帳票は、例えば、標準的なフォーマットとして既に公開されている帳票や、過去に利用された実績のある現物の帳票である。定義体登録用画像取得部２２は、例えば、亜種分類装置２０と接続されたスキャナにより読み取られた画像の情報を取得する。或いは、定義体登録用画像取得部２２は、外部のＤＢ（データベース）サーバ装置に蓄積された登録用帳票の画像データを、ネットワーク等を介して取得するようにしてもよい。定義体登録用画像取得部２２は、取得した登録用帳票の画像データを罫線抽出部２３に出力する。 The definition program registration image acquisition unit 22 acquires image data of a form for registering the OCR definition program (hereinafter, also referred to as a registration form). The registration form is, for example, a form that has already been published as a standard format, or an actual form that has been used in the past. The definition program registration image acquisition unit 22 acquires, for example, image information read by a scanner connected to the subspecies classification device 20. Alternatively, the definition program registration image acquisition unit 22 may acquire the image data of the registration form stored in the external DB (database) server device via a network or the like. The definition program registration image acquisition unit 22 outputs the acquired image data of the registration form to the ruled line extraction unit 23.

罫線抽出部２３は、登録用帳票から罫線を抽出する。罫線抽出部２３は、既存の技術を用いて罫線を抽出する。例えば、罫線抽出部２３は、登録用帳票の画像データをＨｏｕｇｈ変換することにより罫線を抽出する。或いは罫線抽出部２３は、登録用帳票にラプラシアンフィルタやソーベル（Ｓｏｂｅｌ）フィルタを適用することにより、登録用帳票における罫線を抽出するようにしてもよい。罫線抽出部２３は、登録用帳票から抽出した罫線を示す情報を、登録用帳票に対応づけて、類似度算出部２４に出力する。 The ruled line extraction unit 23 extracts a ruled line from the registration form. The ruled line extraction unit 23 extracts a ruled line by using an existing technique. For example, the ruled line extraction unit 23 extracts the ruled line by Hough transforming the image data of the registration form. Alternatively, the ruled line extraction unit 23 may extract the ruled lines in the registration form by applying a Laplacian filter or a Sobel filter to the registration form. The ruled line extraction unit 23 outputs the information indicating the ruled lines extracted from the registration form to the similarity calculation unit 24 in association with the registration form.

類似度算出部２４は、登録用帳票における罫線の特徴に基づいて、帳票同士の類似度を算出する。罫線の特徴は、特にＯＣＲ文字認識をさせる場合に、認識に用いられるような特徴的な罫線の態様であり、例えば、罫線の長さ、本数、矩形の位置やサイズ、個数などを示す情報である。類似度算出部２４は、例えば、これらの罫線の特徴を数値化（ベクトル表現）して高次元のベクトル空間上に配置する。類似度算出部２４は、罫線の特徴量がマッピングされたベクトル空間における帳票同士の相関量をコサイン、内積、距離等によって計算する。類似度算出部２４は、計算した相関量を、帳票同士の類似度とする。類似度算出部２４は、算出した類似度を亜種クラスタリング部２５に出力する。 The similarity calculation unit 24 calculates the similarity between the forms based on the characteristics of the ruled lines in the registration form. The characteristic of the ruled line is a characteristic ruled line mode that is used for recognition, especially when OCR character recognition is performed. For example, information indicating the length, number of lines, position and size of rectangles, number of rectangles, and the like. be. The similarity calculation unit 24, for example, digitizes (vector representation) the features of these ruled lines and arranges them on a high-dimensional vector space. The similarity calculation unit 24 calculates the amount of correlation between forms in the vector space to which the feature amount of the ruled line is mapped by cosine, inner product, distance, and the like. The similarity calculation unit 24 uses the calculated correlation amount as the similarity between the forms. The similarity calculation unit 24 outputs the calculated similarity to the subspecies clustering unit 25.

亜種クラスタリング部２５は、類似度算出部２４によって算出された帳票同士の類似度に基づいて、クラスタ分析を行う。クラスタ分析は、異なる性質のものが混在している集団を、互いに似た性質を持ついくつかの集合に分類する手法である。 The subspecies clustering unit 25 performs cluster analysis based on the similarity between the forms calculated by the similarity calculation unit 24. Cluster analysis is a method of classifying a group of different properties into several sets with similar properties.

亜種クラスタリング部２５は、例えば、階層クラスタ分析を行う。すなわち、亜種クラスタリング部２５は、クラスタ分析をするにあたり、分類する集団の数（クラスタ数）を事前に設定しない。毎年のように亜種が発生する状況において、テキスト化対象の帳票群に対し、幾つのＯＣＲ定義体を定義して、幾つの亜種に分類すれば、ＯＣＲ認識が可能となるかは未知であるためである。亜種クラスタリング部２５は、クラスタ分析した結果を適合判定部２６、及び亜種分類部２７に出力する。 The subspecies clustering unit 25 performs, for example, a hierarchical cluster analysis. That is, the subspecies clustering unit 25 does not set the number of groups to be classified (the number of clusters) in advance when performing cluster analysis. In a situation where variants occur almost every year, it is unknown how many OCR definition programs should be defined for the form group to be converted into text and how many variants should be classified to enable OCR recognition. Because there is. The subspecies clustering unit 25 outputs the result of the cluster analysis to the conformity determination unit 26 and the subspecies classification unit 27.

適合判定部２６は、クラスタ分析された個々の集団（同じ亜種と分類された帳票群）の異常検知を行う。ここでの異常検知とは、分類された帳票群の中から、極端に類似度が低いものが存在しているか否かを検知することである。同じ集団分類された帳票群のうち、類似度が高く互いに特徴が似ているものは同じＯＣＲ定義体を用いてＯＣＲ認識できる可能性が高いが、極端に類似度が低いものはＯＣＲ認識できる可能性が低いと考えられるためである。適合判定部２６は、同じ亜種と分類された帳票群が、同じＯＣＲ定義体を用いたＯＣＲ認識できるか、すなわち、同一のＯＣＲ定義体に適合するか否かを判定する。 The conformity determination unit 26 detects anomalies in individual populations (form groups classified as the same subspecies) that have been cluster-analyzed. The abnormality detection here is to detect whether or not there is an extremely low degree of similarity among the classified form groups. Among the forms classified in the same group, those with high similarity and similar characteristics are likely to be OCR-recognizable using the same OCR definition program, but those with extremely low similarity are likely to be OCR-recognizable. This is because it is considered to have low sex. The conformity determination unit 26 determines whether the forms classified as the same subspecies can recognize OCR using the same OCR definition program, that is, whether or not they conform to the same OCR definition program.

適合判定部２６は、例えば、異常、つまり極端に類似度が低い帳票、が検出された場合、その旨を示す警告を、表示部に表示させるようにしてもよい。これにより、種別分類部１６は、同じ亜種として分類された帳票の中に極端に類似度が低い対象帳票があることを、作業者に知らせることができる。作業者は、警告に応じて、対象帳票を目視で確認する等して、個別の対応を行う等の対応を行うことができる。適合判定部２６は、判定結果を亜種分類部２７に出力する。 For example, when an abnormality, that is, a form having an extremely low degree of similarity is detected, the conformity determination unit 26 may display a warning to that effect on the display unit. As a result, the type classification unit 16 can inform the worker that there is a target form having an extremely low similarity among the forms classified as the same subspecies. In response to the warning, the worker can take measures such as visually checking the target form and taking individual measures. The conformity determination unit 26 outputs the determination result to the subspecies classification unit 27.

亜種分類部２７は、亜種クラスタリング部２５によるクラスタ分析の結果と、適合判定部２６による異常検知の結果とを用いて、登録用帳票を亜種ごとに分類する。亜種分類部２７は、クラスタ分析により分類された亜種の集団のそれぞれから、異常検知された帳票を取り除いた集団を、同じ亜種に分類された帳票群とする。なお、亜種分類部２７は、亜種クラスタリング部２５により階層クラスタ分析した結果から、どの階層の分類結果を用いるかを任意に決定してよい。亜種分類部２７は、例えば、同一の亜種として分類された帳票群の分布や、ＯＣＲ認識の精度等に応じて、ＯＣＲ定義体にて読み取り可能な範囲を決定する。亜種分類部２７は、亜種ごとに分類した分類結果を、出力部２８を介して、文字認識装置３０に出力する。 The subspecies classification unit 27 classifies the registration form for each subspecies using the result of the cluster analysis by the subspecies clustering unit 25 and the result of the abnormality detection by the conformity determination unit 26. The subspecies classification unit 27 sets a group in which abnormally detected forms are removed from each of the subspecies groups classified by cluster analysis as a form group classified into the same subspecies. The subspecies classification unit 27 may arbitrarily determine which layer of the classification result to use from the result of the hierarchical cluster analysis by the subspecies clustering unit 25. The subspecies classification unit 27 determines, for example, a range that can be read by the OCR definition program according to the distribution of the forms classified as the same subspecies, the accuracy of OCR recognition, and the like. The subspecies classification unit 27 outputs the classification result classified for each subspecies to the character recognition device 30 via the output unit 28.

（分類実行）
亜種分類装置２０は、分類実行の段階において、対象帳票を亜種ごとに分類する。 (Classification execution)
The subspecies classification device 20 classifies the target form for each subspecies at the stage of executing the classification.

対象画像取得部２１は、対象帳票の画像データを取得する。対象画像取得部２１が対象帳票の画像データを取得する方法は、定義体登録用画像取得部２２が登録用帳票の画像データを取得する方法と同様であるためその説明を省略する。対象画像取得部２１は、取得した対象画像の画像データを罫線抽出部２３に出力する。 The target image acquisition unit 21 acquires the image data of the target form. The method of acquiring the image data of the target form by the target image acquisition unit 21 is the same as the method of acquiring the image data of the registration form by the definition program registration image acquisition unit 22, and thus the description thereof will be omitted. The target image acquisition unit 21 outputs the acquired image data of the target image to the ruled line extraction unit 23.

罫線抽出部２３は、対象帳票から罫線を抽出する。罫線抽出部２３が対象帳票から罫線を抽出する方法は、登録用帳票から罫線を抽出する方法と同等であるため、その説明を省略する。罫線抽出部２３は、対象帳票から抽出した罫線を示す情報を、登録用帳票に対応づけて、類似度算出部２４に出力する。 The ruled line extraction unit 23 extracts a ruled line from the target form. Since the method of extracting the ruled lines from the target form by the ruled line extraction unit 23 is the same as the method of extracting the ruled lines from the registration form, the description thereof will be omitted. The ruled line extraction unit 23 outputs the information indicating the ruled lines extracted from the target form to the similarity calculation unit 24 in association with the registration form.

類似度算出部２４は、対象帳票における罫線の特徴に基づいて、代表帳票との類似度を算出する。類似度算出部２４が類似度を算出する方法は、既に説明したため、その説明を省略する。類似度算出部２４は、算出した類似度を亜種クラスタリング部２５に出力する。 The similarity calculation unit 24 calculates the similarity with the representative form based on the characteristics of the ruled lines in the target form. Since the method for calculating the similarity by the similarity calculation unit 24 has already been described, the description thereof will be omitted. The similarity calculation unit 24 outputs the calculated similarity to the subspecies clustering unit 25.

亜種クラスタリング部２５は、類似度算出部２４によって算出された対象帳票の帳票同士の類似度に基づいて、クラスタ分析を行う。亜種クラスタリング部２５がクラスタ分析を行う方法は既に説明したためその説明を省略する。亜種クラスタリング部２５は、クラスタ分析した結果を適合判定部２６、及び亜種分類部２７に出力する。 The subspecies clustering unit 25 performs cluster analysis based on the similarity between the target forms calculated by the similarity calculation unit 24. Since the method by which the subspecies clustering unit 25 performs the cluster analysis has already been described, the description thereof will be omitted. The subspecies clustering unit 25 outputs the result of the cluster analysis to the conformity determination unit 26 and the subspecies classification unit 27.

適合判定部２６は、クラスタ分析された個々の集団（同じ亜種と分類された帳票群）の異常検知を行う。適合判定部２６が異常検知を行う方法は既に説明したためその説明を省略する。適合判定部２６は、異常検知を行った結果を亜種分類部２７に出力する。 The conformity determination unit 26 detects anomalies in individual populations (form groups classified as the same subspecies) that have been cluster-analyzed. Since the method for detecting the abnormality by the conformity determination unit 26 has already been described, the description thereof will be omitted. The conformity determination unit 26 outputs the result of abnormality detection to the subspecies classification unit 27.

亜種分類部２７は、亜種クラスタリング部２５によるクラスタ分析の結果と、適合判定部２６による異常検知の結果とを用いて、対象帳票を亜種ごとに分類する。亜種分類部２７が亜種ごとに分類を行う方法は既に説明したためその説明を省略する。亜種分類部２７は、対象帳票を亜種ごとに分類した分類結果を、出力部２８を介して、文字認識装置３０に出力する。 The subspecies classification unit 27 classifies the target form for each subspecies using the result of the cluster analysis by the subspecies clustering unit 25 and the result of the abnormality detection by the conformity determination unit 26. Since the method of classifying each subspecies by the subspecies classification unit 27 has already been described, the description thereof will be omitted. The subspecies classification unit 27 outputs the classification result of classifying the target form for each subspecies to the character recognition device 30 via the output unit 28.

図４は、実施形態の種別分類装置１０が行う処理の流れを示すフロー図である。図４には、事前準備の段階において、種別分類装置１０が学習済みモデルを生成する処理の流れが示されている。 FIG. 4 is a flow chart showing a flow of processing performed by the type classification device 10 of the embodiment. FIG. 4 shows a flow of processing in which the classification device 10 generates a trained model at the stage of preparation.

種別分類装置１０は、事前準備の段階において、学習用帳票（学習用の帳票群）を取得する（ステップＳ１１）。種別分類装置１０は、学習用帳票を用いて、学習データ（学習用帳票）と教師データ（種別）とを対応させた、学習用のデータセットを生成する（ステップＳ１２）。種別分類装置１０は、学習モデルに、学習データ（学習用帳票）を入力する（ステップＳ１３）。種別分類装置１０は、学習モデルから得られる出力（種別の予測値）と、教師データ（種別の正解）との誤差に応じて、その誤差が小さくなるように学習モデルのパラメータを更新する（ステップＳ１４）。種別分類装置１０は、所定の終了条件を充足するか否かを判定する（ステップＳ１５）。ここでの終了条件とは、例えば、誤差が所定の閾値を下回ったこと、或いは、学習回数の上限に達したこと等である。種別分類装置１０は、終了条件を充足する場合、学習を終了する。種別分類装置１０は、終了条件を充足しない場合、ステップＳ１３に戻り学習を繰返す。 The type classification device 10 acquires a learning form (learning form group) at the stage of advance preparation (step S11). The type classification device 10 uses the learning form to generate a learning data set in which the learning data (learning form) and the teacher data (type) are associated with each other (step S12). The type classification device 10 inputs learning data (learning form) into the learning model (step S13). The type classification device 10 updates the parameters of the learning model so that the error becomes smaller according to the error between the output obtained from the learning model (predicted value of the type) and the teacher data (correct answer of the type) (step). S14). The type classification device 10 determines whether or not the predetermined end condition is satisfied (step S15). The termination condition here is, for example, that the error is below a predetermined threshold value, or that the upper limit of the number of learnings has been reached. The type classification device 10 ends learning when the end condition is satisfied. If the end condition is not satisfied, the type classification device 10 returns to step S13 and repeats learning.

図５は、実施形態の帳票認識システム１が行う処理の流れを示すシーケンス図である。図５には、事前準備の段階において、ＯＣＲ定義体が登録される処理の流れが示されている。 FIG. 5 is a sequence diagram showing a flow of processing performed by the form recognition system 1 of the embodiment. FIG. 5 shows the flow of processing in which the OCR definition program is registered in the preparatory stage.

亜種分類装置２０は、登録用帳票（定義体登録用の帳票）をクラスタ分析（亜種クラスタリング、と記載）する（ステップＳ２１）。亜種分類装置２０は、クラスタ分析した結果得られた集団（クラスタ）内の帳票から代表帳票を選択する（ステップＳ２２）。亜種分類装置２０は、例えば、クラスタ分析した結果得られた集団から、異常検知された帳票を除いた集団を生成する。亜種分類装置２０は、異常検知された帳票を除いた集団のなかで、最も共通した特徴を持つ帳票を代表帳票として選択する。最も共通した特徴を持つ帳票とは、例えば、特徴量のベクトル空間にマッピングされた帳票群の代表値（例えば、中央値）に最も近い位置に配置される帳票である。亜種分類装置２０は、集団（クラスタ）内の代表帳票を文字認識装置３０に出力する。文字認識装置３０は、亜種分類装置２０から取得した代表帳票に対応するＯＣＲ定義体を生成し、生成した定義体を記憶させるなどして登録する（ステップＳ２３）。 The subspecies classification device 20 performs cluster analysis (described as subspecies clustering) of the registration form (form for defining definition program registration) (step S21). The subspecies classification device 20 selects a representative form from the forms in the group (cluster) obtained as a result of the cluster analysis (step S22). The subspecies classification device 20 generates, for example, a group obtained by removing the forms detected as abnormalities from the group obtained as a result of cluster analysis. The subspecies classification device 20 selects the form having the most common characteristics as the representative form from the group excluding the forms in which the abnormality is detected. The form having the most common features is, for example, a form arranged at the position closest to the representative value (for example, the median value) of the form group mapped in the vector space of the feature amount. The subspecies classification device 20 outputs the representative form in the group (cluster) to the character recognition device 30. The character recognition device 30 generates an OCR definition program corresponding to the representative form acquired from the subspecies classification device 20, stores the generated definition program, and registers the generated definition program (step S23).

図６は、実施形態の亜種分類装置２０が行う処理の流れを示すフロー図である。図６には、図５のステップＳ２１〜Ｓ２２に対応する処理の流れの詳細が示されている。 FIG. 6 is a flow chart showing a flow of processing performed by the subspecies classification device 20 of the embodiment. FIG. 6 shows the details of the processing flow corresponding to steps S21 to S22 of FIG.

亜種分類装置２０は、複数の登録用帳票（定義体登録用の帳票群）を取得し（ステップＳ２１１）、取得した登録用帳票のそれぞれの画像データから罫線を抽出する（ステップＳ２１２）。亜種分類装置２０は、登録用帳票のそれぞれの帳票間の罫線の特徴量の類似度を算出する（ステップＳ２１３）。亜種分類装置２０は、類似度に基づくクラスタ分析（亜種クラスタリング）を行う（ステップＳ２１４）。 The subspecies classification device 20 acquires a plurality of registration forms (form registration form group) (step S211), and extracts ruled lines from the image data of each of the acquired registration forms (step S212). The subspecies classification device 20 calculates the similarity of the feature amount of the ruled line between each form of the registration form (step S213). The subspecies classification device 20 performs cluster analysis (subspecies clustering) based on the similarity (step S214).

亜種分類装置２０は、同一のクラスタ内に分類された帳票のそれぞれに異常検知（適合判定）を行う（ステップＳ２２１）。亜種分類装置２０は適合判定の結果を用いてクラスタ内の代表帳票を選択する（ステップＳ２２２）。例えば、亜種分類装置２０は、異常検知（適合判定）の結果、他の帳票と比較して極端に類似度が低い帳票を、そのクラスタから取り除く。亜種分類装置２０は、極端に類似度が低い帳票を除いた後の帳票群から、集団内で共通する特徴を最も備えている帳票を、代表帳票として選択する。 The subspecies classification device 20 performs abnormality detection (conformity determination) for each of the forms classified in the same cluster (step S221). The subspecies classification device 20 selects a representative form in the cluster using the result of the conformity determination (step S222). For example, the subspecies classification device 20 removes a form having an extremely low similarity as a result of abnormality detection (conformity determination) from the cluster. The subspecies classification device 20 selects as the representative form the form having the most common characteristics in the group from the form group after excluding the forms having extremely low similarity.

図７は、実施形態の亜種分類装置２０が行う処理を説明する図である。図７には、クラスタ分析の結果が模式的に示されている。図７の横軸と縦軸とはそれぞれ特徴量を示している。図７は２次元の特徴量のベクトル空間である。図７に示す通り、ベクトル空間に、帳票を、その特徴量に応じてマッピングさせると、その距離に応じていくつかの集団に分類することができる。図７では、クラスタ分析の結果、クラスタＫ１〜Ｋ５の５つの集団に分類された例が示されている。例えば、クラスタＫ２には、点Ｐ１〜Ｐ５に対応する５つの帳票が含まれている。このうち、点Ｐ２〜Ｐ４に対応する３つの帳票は互いの距離が近く、互いの類似度が高い。一方、点Ｐ１は、点Ｐ２〜Ｐ４の点群からやや離れた距離にマッピングされており、点Ｐ２〜Ｐ４に対応する３つの帳票と似ていない、つまり３つの帳票との類似度が低いと考えられる。点Ｐ５についても同様に、点Ｐ２〜Ｐ４に対応する３つの帳票との類似度が低いと考えられる。この場合、適合判定部２６は、例えば、クラスタＫ２に分類された帳票から、点Ｐ１、Ｐ５に相当する帳票を極端に類似度が低い帳票として異常検知する。 FIG. 7 is a diagram illustrating a process performed by the subspecies classification device 20 of the embodiment. FIG. 7 schematically shows the result of the cluster analysis. The horizontal axis and the vertical axis of FIG. 7 indicate the feature amounts, respectively. FIG. 7 is a vector space of two-dimensional features. As shown in FIG. 7, if the form is mapped in the vector space according to its feature amount, it can be classified into several groups according to the distance. FIG. 7 shows an example of being classified into five groups of clusters K1 to K5 as a result of cluster analysis. For example, cluster K2 contains five forms corresponding to points P1 to P5. Of these, the three forms corresponding to points P2 to P4 are close to each other and have a high degree of similarity to each other. On the other hand, the point P1 is mapped to a distance slightly away from the point cloud of the points P2 to P4, and is not similar to the three forms corresponding to the points P2 to P4, that is, the similarity with the three forms is low. Conceivable. Similarly, it is considered that the point P5 has a low degree of similarity to the three forms corresponding to the points P2 to P4. In this case, the conformity determination unit 26 abnormally detects, for example, the forms corresponding to the points P1 and P5 as the forms having extremely low similarity from the forms classified into the cluster K2.

図８は、実施形態の帳票認識システム１が行う処理の流れを示すシーケンス図である。図８には、分類実行の段階における帳票認識システム１の処理の流れが示されている。 FIG. 8 is a sequence diagram showing a flow of processing performed by the form recognition system 1 of the embodiment. FIG. 8 shows the processing flow of the form recognition system 1 at the stage of execution of classification.

種別分類装置１０は、学習済みモデルを用いて、対象帳票の種別を分類する（ステップＳ３０）。種別分類装置１０は分類結果を亜種分類装置２０に出力する。 The type classification device 10 classifies the types of target forms using the trained model (step S30). The classification device 10 outputs the classification result to the subspecies classification device 20.

例えば、種別分類装置１０は、種別１に分類された対象帳票のそれぞれ（種別１の帳票群）を示す情報を亜種分類装置２０−１に出力する。亜種分類装置２０−１は、種別１の帳票群をクラスタ分析して亜種ごとに分類する（ステップＳ３１）。亜種分類装置２０−１は、分類結果を文字認識装置３０に出力する。例えば、亜種分類装置２０−１は、種別１の亜種Ａに分類された対象帳票のそれぞれ（種別１の亜種Ａの帳票群）を示す情報を文字認識装置３０に出力する。亜種分類装置２０−１は、種別１の亜種Ｂに分類された対象帳票のそれぞれ（種別１の亜種Ｂの帳票群）を示す情報を文字認識装置３０に出力する。 For example, the type classification device 10 outputs information indicating each of the target forms classified into the type 1 (form group of the type 1) to the subspecies classification device 20-1. The subspecies classification device 20-1 performs cluster analysis of the type 1 form group and classifies each subspecies (step S31). The subspecies classification device 20-1 outputs the classification result to the character recognition device 30. For example, the subspecies classification device 20-1 outputs information indicating each of the target forms classified into the subspecies A of the type 1 (form group of the subspecies A of the type 1) to the character recognition device 30. The subspecies classification device 20-1 outputs information indicating each of the target forms classified into the subspecies B of the type 1 (form group of the subspecies B of the type 1) to the character recognition device 30.

例えば、種別分類装置１０は、種別２に分類された対象帳票のそれぞれ（種別１の帳票群）を示す情報を亜種分類装置２０−２に出力する。亜種分類装置２０−２は、種別２の帳票群をクラスタ分析して亜種ごとに分類する（ステップＳ３２）。亜種分類装置２０−２は、分類結果を文字認識装置３０に出力する。例えば、亜種分類装置２０−２は、種別２の亜種Ａに分類された対象帳票のそれぞれ（種別２の亜種Ａの帳票群）を示す情報を文字認識装置３０に出力する。亜種分類装置２０−２は、種別２の亜種Ｂに分類された対象帳票のそれぞれ（種別２の亜種Ｂの帳票群）を示す情報を文字認識装置３０に出力する。 For example, the type classification device 10 outputs information indicating each of the target forms classified into the type 2 (form group of the type 1) to the subspecies classification device 20-2. The subspecies classification device 20-2 performs cluster analysis of the type 2 form group and classifies each subspecies (step S32). The subspecies classification device 20-2 outputs the classification result to the character recognition device 30. For example, the subspecies classification device 20-2 outputs information indicating each of the target forms classified into the subspecies A of the type 2 (form group of the subspecies A of the type 2) to the character recognition device 30. The subspecies classification device 20-2 outputs information indicating each of the target forms classified into the subspecies B of the type 2 (form group of the subspecies B of the type 2) to the character recognition device 30.

例えば、種別分類装置１０は、種別Ｎに分類された対象帳票のそれぞれ（種別Ｎの帳票群）を示す情報を亜種分類装置２０−Ｎに出力する。亜種分類装置２０−Ｎは、種別Ｎの帳票群をクラスタ分析して亜種ごとに分類する（ステップＳ３３）。亜種分類装置２０−Ｎは、分類結果を文字認識装置３０に出力する。例えば、亜種分類装置２０−Ｎは、種別Ｎの亜種Ａに分類された対象帳票のそれぞれ（種別Ｎの亜種Ａの帳票群）を示す情報を文字認識装置３０に出力する。亜種分類装置２０−Ｎは、種別Ｎの亜種Ｂに分類された対象帳票のそれぞれ（種別Ｎの亜種Ｂの帳票群）を示す情報を文字認識装置３０に出力する。 For example, the type classification device 10 outputs information indicating each of the target forms classified into the type N (form group of the type N) to the subspecies classification device 20-N. The subspecies classification device 20-N performs cluster analysis of the form group of type N and classifies each subspecies (step S33). The subspecies classification device 20-N outputs the classification result to the character recognition device 30. For example, the subspecies classification device 20-N outputs information indicating each of the target forms classified into the subspecies A of the type N (form group of the subspecies A of the type N) to the character recognition device 30. The subspecies classification device 20-N outputs information indicating each of the target forms classified into the subspecies B of the type N (form group of the subspecies B of the type N) to the character recognition device 30.

文字認識装置３０は、亜種分類装置２０から取得した亜種ごとに、その亜種の代表帳票に対応するＯＣＲ定義体を用いて、その亜種に分類された帳票群をＯＣＲ文字認識させる（ステップＳ３４）。 The character recognition device 30 uses the OCR definition program corresponding to the representative form of the subspecies for each subspecies acquired from the subspecies classification device 20 to recognize the form group classified into the subspecies as OCR characters ( Step S34).

図９は、実施形態の種別分類装置１０が行う処理の流れを示すフロー図である。図９には、図８のステップＳ３０に対応する処理の流れの詳細が示されている。 FIG. 9 is a flow chart showing a flow of processing performed by the type classification device 10 of the embodiment. FIG. 9 shows the details of the processing flow corresponding to step S30 of FIG.

種別分類装置１０は、対象帳票を取得し（ステップＳ３０１）、取得した対象帳票のそれぞれの画像データを学習済みモデルに入力することにより種別を予測（推定）する（ステップＳ３０２）。種別分類装置１０は、学習済みモデルによって予測された種別の確信度が所定の閾値以上であるか否かを判定する（ステップＳ３０３）。種別分類装置１０は、確信度が所定の閾値以上である場合、予測された種別を、その対象帳票の種別として確定させる（ステップＳ３０４）一方、種別分類装置１０は、確信度が所定の閾値未満である場合、その対象帳票の種別を、その他の種別（種別不明）とする（ステップＳ３０５）。 The type classification device 10 acquires the target form (step S301) and predicts (estimates) the type by inputting the image data of each of the acquired target forms into the trained model (step S302). The type classification device 10 determines whether or not the certainty of the type predicted by the trained model is equal to or higher than a predetermined threshold value (step S303). When the certainty level is equal to or higher than a predetermined threshold value, the type classification device 10 determines the predicted type as the type of the target form (step S304), while the type classification device 10 determines the certainty level to be less than the predetermined threshold value. If, the type of the target form is set to another type (type unknown) (step S305).

図１０は、実施形態の種別分類装置１０が行う処理の流れを示すフロー図である。図１０には、図８のステップＳ３１（Ｓ３２、Ｓ３３）に対応する処理の流れの詳細が示されている。ここではステップＳ３１の処理を例に、処理の流れの詳細を説明する。ステップＳ３２、Ｓ３３についても同様の処理の流れである。 FIG. 10 is a flow chart showing a flow of processing performed by the type classification device 10 of the embodiment. FIG. 10 shows the details of the processing flow corresponding to step S31 (S32, S33) of FIG. Here, the details of the processing flow will be described by taking the processing of step S31 as an example. The same processing flow is also applied to steps S32 and S33.

亜種分類装置２０−１は、種別ごとの対象帳票を取得し（ステップＳ３１１）、取得した対象帳票のそれぞれの画像データから罫線を抽出する（ステップＳ３１２）。亜種分類装置２０−１は、対象帳票と、代表帳票との類似度を算出する（ステップＳ３１３）。亜種分類装置２０−１は、類似度に基づくクラスタ分析（亜種クラスタリング）を行う（ステップＳ３１４）。亜種分類装置２０−１は、適合判定（異常検知）を行い（ステップＳ３１５）、異常検知の対象とならなかった帳票を、その亜種に分類された帳票として確定させる（ステップＳ３１６）。一方、亜種分類装置２０は、異常検知された帳票をその他の亜種（亜種不明）として分類するとする（ステップＳ３１７）。 The subspecies classification device 20-1 acquires a target form for each type (step S311), and extracts a ruled line from each image data of the acquired target form (step S312). The subspecies classification device 20-1 calculates the degree of similarity between the target form and the representative form (step S313). The subspecies classification device 20-1 performs cluster analysis (subspecies clustering) based on the similarity (step S314). The subspecies classification device 20-1 performs conformity determination (abnormality detection) (step S315), and determines a form that is not the target of abnormality detection as a form classified into the subspecies (step S316). On the other hand, the subspecies classification device 20 classifies the form in which the abnormality is detected as another subspecies (subspecies unknown) (step S317).

以上説明したように、実施形態の帳票認識システム１は、対象画像取得部１１と、種別分類部１６と、罫線抽出部２３（「特徴抽出部」の一例）と、亜種分類部２７とを備える。対象画像取得部１１は、対象帳票を取得する。種別分類部１６は、学習済みモデルを用いて、対象帳票の種別を分類する。学習済みモデルは、学習用帳票を入力させることにより得られる出力が、当該学習用帳票に対応する種別に近づくように学習されたモデルであって、入力された帳票の種別を予測するモデルである。罫線抽出部２３は、対象帳票の特徴を抽出する。亜種分類部２７は、種別分類部１６による分類結果、及び罫線抽出部２３による抽出結果に基づき、代表帳票のＯＣＲ定義体（予め登録された文字認識の定義体）に対応する帳票の特徴との類似度合いを用いてクラスタ分析（機械学習）を行うことにより対象帳票を分類する。 As described above, the form recognition system 1 of the embodiment includes the target image acquisition unit 11, the type classification unit 16, the ruled line extraction unit 23 (an example of the “feature extraction unit”), and the subspecies classification unit 27. Be prepared. The target image acquisition unit 11 acquires the target form. The type classification unit 16 classifies the type of the target form by using the trained model. The trained model is a model that is trained so that the output obtained by inputting the learning form approaches the type corresponding to the learning form, and is a model that predicts the type of the input form. .. The ruled line extraction unit 23 extracts the features of the target form. The subspecies classification unit 27 has the characteristics of the form corresponding to the OCR definition structure (pre-registered character recognition definition structure) of the representative form based on the classification result by the type classification unit 16 and the extraction result by the ruled line extraction unit 23. The target forms are classified by performing cluster analysis (machine learning) using the degree of similarity of.

これにより、実施形態の帳票認識システム１では、代表帳票のＯＣＲ定義体との類似度に基づいて、代表帳票に似ている帳票群を、１つの亜種として分類することができる。このため、亜種として分類された帳票群を、代表帳票に対応するＯＣＲ定義体を用いてＯＣＲ文字認識させることができ、文字認識が誤る事例を低減させることが可能である。したがって、ＯＣＲ文字認識に用いる定義体を補正することなく、ＯＣＲ文字認識ができるように亜種を仕分けることができる。 Thereby, in the form recognition system 1 of the embodiment, the form group similar to the representative form can be classified as one subspecies based on the similarity of the representative form with the OCR definition program. Therefore, the form group classified as a subspecies can be made to recognize OCR characters by using the OCR definition program corresponding to the representative form, and it is possible to reduce the cases where the character recognition is erroneous. Therefore, the variants can be sorted so that OCR character recognition can be performed without correcting the definition program used for OCR character recognition.

また、実施形態の帳票認識システム１では、亜種分類部２７は、罫線抽出部２３によって抽出された罫線の特徴と、代表帳票（ＯＣＲ定義体に対応する帳票）における罫線の特徴との類似度合いを用いたクラスタ分析を行うことにより、対象帳票を分類する。これにより、実施形態の帳票認識システム１では、教師データを用意することなく、類似度に応じた分類が可能となり、手間を抑えた分類を行うことができる。 Further, in the form recognition system 1 of the embodiment, the subspecies classification unit 27 has a degree of similarity between the characteristics of the ruled lines extracted by the ruled line extraction unit 23 and the characteristics of the ruled lines in the representative form (form corresponding to the OCR definition program). The target forms are classified by performing cluster analysis using. As a result, in the form recognition system 1 of the embodiment, it is possible to perform classification according to the degree of similarity without preparing teacher data, and it is possible to perform classification with less time and effort.

また、実施形態の帳票認識システム１では、代表帳票は、登録用帳票に、クラスタ分析を行うことにより得られるクラスタ内の帳票から選択された帳票である。これにより、実施形態の帳票認識システム１では、亜種として同一集団に分類された帳票群の中から、その帳票群が共通に有する特徴をもつ帳票を、代表帳票選択することができる。したがって、同一集団に分類された帳票群を、同じＯＣＲ定義体で文字認識させることが可能となる。 Further, in the form recognition system 1 of the embodiment, the representative form is a form selected from the forms in the cluster obtained by performing the cluster analysis on the registration form. As a result, in the form recognition system 1 of the embodiment, it is possible to select a representative form from a group of forms classified into the same group as a subspecies, which has the characteristics common to the group of forms. Therefore, it is possible to recognize the form group classified into the same group with the same OCR definition program.

また、実施形態の帳票認識システム１は、適合判定部２６を更に備える。適合判定部２６は、亜種分類部２７による分類結果に基づき、代表帳票と同一グループに分類された亜種が、代表帳票に対応するＯＣＲ定義体を用いた文字認識に適合するか否かを判定する。これにより、実施形態の帳票認識システム１では、同じＯＣＲ定義体で文字認識させることが困難な帳票をそのグループから取り除くことができ、ＯＣＲ文字認識が誤りとなる事象を低減させることが可能である。 Further, the form recognition system 1 of the embodiment further includes a conformity determination unit 26. Based on the classification result by the subspecies classification unit 27, the conformity determination unit 26 determines whether or not the subspecies classified into the same group as the representative form conforms to the character recognition using the OCR definition program corresponding to the representative form. judge. As a result, in the form recognition system 1 of the embodiment, it is possible to remove from the group the forms that are difficult to recognize characters with the same OCR definition program, and it is possible to reduce the phenomenon that the OCR character recognition is erroneous. ..

また、実施形態の帳票認識システム１では、適合判定部２６は、代表帳票と同一グループに分類された対象帳票における罫線の特徴と、代表帳票における罫線の特徴との類似度合いに基づき、適合するか否かを判定する。これにより、実施形態の帳票認識システム１では、代表帳票と罫線の特徴が似ていない帳票を適合しないとして異常検知することができ、より精度よくＯＣＲ文字認識を行うことが可能となる。 Further, in the form recognition system 1 of the embodiment, whether the conformity determination unit 26 conforms based on the degree of similarity between the characteristics of the ruled lines in the target form classified into the same group as the representative form and the characteristics of the ruled lines in the representative form. Judge whether or not. As a result, in the form recognition system 1 of the embodiment, it is possible to detect an abnormality as a form whose ruled line characteristics are not similar to those of the representative form, and it is possible to perform OCR character recognition more accurately.

また、上述した実施形態では、種別ごとに分類した後に、同一種別のなかで亜種ごとに分類を行う場合を例示して説明した。しかしながらこれに限定されることはない。種別に分類することなく、様々な種別の帳票が混在しているなかから、直接、亜種ごとの分類を行うようにしてもよい。この構成であっても、罫線の構成が同一で、帳票のタイトルのみが異なる複数の帳票が存在するなどの特殊なケースを除き、事前に選択した代表帳票に対応するＯＣＲ定義体で読み込み可能な亜種ごとの分類を行うことが可能である。 Further, in the above-described embodiment, a case where classification is performed for each subspecies within the same type after classification for each type has been described as an example. However, it is not limited to this. Instead of classifying by type, it is possible to directly classify by subspecies from a mixture of various types of forms. Even with this configuration, it can be read by the OCR definition program corresponding to the representative form selected in advance, except for special cases where the ruled line configuration is the same and there are multiple forms with different form titles. It is possible to classify by subspecies.

なお、上述した実施形態では、外部のＤＢ（データベース）サーバ装置に種々のデータ（学習用帳票の画像データ、登録用帳票の画像データ等）を蓄積させ、帳票認識システム１がネットワーク等を介して当該種々のデータを取得する構成について説明した。この場合における外部のＤＢサーバ装置は、任意のコンピュータ装置であってよいが、例えば、ネットワークに接続されるストレージ装置、いわゆるＮＡＳ（Network Attached Storage）などであってよい。ＮＡＳにはファイルシステムやネットワーク通信機能が備えられている。このため、帳票認識システム１への導入が容易であり、蓄積させるデータの容量に応じて追加することも容易である。また、帳票認識システム１の異なる種類の複数の装置（種別分類装置１０、亜種分類装置２０、及び文字認識装置３０など）のそれぞれからのデータを、当該複数の装置のそれぞれで共有させることが容易となる。 In the above-described embodiment, various data (image data of learning form, image data of registration form, etc.) are stored in an external DB (database) server device, and the form recognition system 1 communicates via a network or the like. The configuration for acquiring the various data has been described. The external DB server device in this case may be any computer device, but may be, for example, a storage device connected to a network, so-called NAS (Network Attached Storage) or the like. NAS is equipped with a file system and network communication function. Therefore, it is easy to introduce it into the form recognition system 1, and it is also easy to add it according to the amount of data to be stored. In addition, data from each of a plurality of devices of different types (type classification device 10, subspecies classification device 20, character recognition device 30, etc.) of the form recognition system 1 can be shared by each of the plurality of devices. It will be easy.

上述した実施形態における帳票認識システム１の全部又は一部をコンピュータで実現するようにしてもよい。その場合、この機能を実現するためのプログラムをコンピュータ読み取り可能な記録媒体に記録して、この記録媒体に記録されたプログラムをコンピュータシステムに読み込ませ、実行することによって実現してもよい。なお、ここでいう「コンピュータシステム」とは、ＯＳや周辺機器等のハードウェアを含むものとする。また、「コンピュータ読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ−ＲＯＭ等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置のことをいう。さらに「コンピュータ読み取り可能な記録媒体」とは、インターネット等のネットワークや電話回線等の通信回線を介してプログラムを送信する場合の通信線のように、短時間の間、動的にプログラムを保持するもの、その場合のサーバやクライアントとなるコンピュータシステム内部の揮発性メモリのように、一定時間プログラムを保持しているものも含んでもよい。また上記プログラムは、前述した機能の一部を実現するためのものであってもよく、さらに前述した機能をコンピュータシステムにすでに記録されているプログラムとの組み合わせで実現できるものであってもよく、ＦＰＧＡ（Field Programmable Gate Array）等のプログラマブルロジックデバイスを用いて実現されるものであってもよい。 All or part of the form recognition system 1 in the above-described embodiment may be realized by a computer. In that case, the program for realizing this function may be recorded on a computer-readable recording medium, and the program recorded on the recording medium may be read by the computer system and executed. The term "computer system" as used herein includes hardware such as an OS and peripheral devices. Further, the "computer-readable recording medium" refers to a portable medium such as a flexible disk, a magneto-optical disk, a ROM, or a CD-ROM, or a storage device such as a hard disk built in a computer system. Further, a "computer-readable recording medium" is a communication line for transmitting a program via a network such as the Internet or a communication line such as a telephone line, and dynamically holds the program for a short period of time. It may also include a program that holds a program for a certain period of time, such as a volatile memory inside a computer system that serves as a server or a client in that case. Further, the above program may be for realizing a part of the above-mentioned functions, and may be further realized for realizing the above-mentioned functions in combination with a program already recorded in the computer system. It may be realized by using a programmable logic device such as FPGA (Field Programmable Gate Array).

以上、この発明の実施形態について図面を参照して詳述してきたが、具体的な構成はこの実施形態に限られるものではなく、この発明の要旨を逸脱しない範囲の設計等も含まれる。 Although the embodiments of the present invention have been described in detail with reference to the drawings, the specific configuration is not limited to this embodiment, and includes designs and the like within a range that does not deviate from the gist of the present invention.

１帳票認識システム
１０種別分類装置
１１対象画像取得部
１６種別分類部
２０亜種分類装置
２１対象画像取得部
２３罫線抽出部
２７亜種分類部
３０文字認識装置 1 Form recognition system 10 Type classification device 11 Target image acquisition unit 16 Type classification unit 20 Subspecies classification device 21 Target image acquisition unit 23 Ruled line extraction unit 27 Subspecies classification unit 30 Character recognition device

Claims

The acquisition department that acquires the target form, which is the form to be sorted,
A type classification unit that classifies the types of the target form using the trained model,
A feature extraction unit that extracts the features of the target form,
Based on the classification result by the type classification unit and the extraction result by the feature extraction unit, the target form is performed by performing machine learning using the degree of similarity with the characteristics of the form corresponding to the pre-registered definition program of character recognition. Subspecies classification department that classifies
With
The trained model is a model that is trained so that the output obtained by inputting the learning form approaches the type corresponding to the learning form, and is a model that predicts the type of the input form. be,
A form sorting system characterized by this.

The acquisition department that acquires the target form, which is the form to be sorted,
A feature extraction unit that extracts the features of the target form,
Based on the extraction result by the feature extraction unit, the subspecies classification unit that classifies the target form by performing machine learning using the degree of similarity with the feature of the form corresponding to the definition structure of the character recognition defined in advance.
A form sorting system characterized by being equipped with.

The subspecies classification unit classifies the target form by performing cluster analysis using the degree of similarity between the characteristics of the ruled lines extracted by the feature extraction unit and the characteristics of the ruled lines in the form corresponding to the definition program. ,
The form sorting system according to claim 1 or 2.

The form corresponding to the definition program is a form selected from the forms in the cluster obtained by performing the cluster analysis on the registration form.
The form sorting system according to claim 3.

Based on the classification result by the subspecies classification unit, the conformity determination unit determines whether or not the target form classified into the same group as the form corresponding to the definition program conforms to the character recognition using the definition program. Further prepare,
The form sorting system according to any one of claims 1 to 4.

The conformity determination unit is based on the degree of similarity between the characteristics of the ruled lines in the target form classified into the same group as the form corresponding to the definition program and the characteristics of the ruled lines in the form corresponding to the definition program. Judging whether or not it conforms to character recognition using
The form sorting system according to claim 5.

The acquisition department acquires the target form, which is the form to be sorted,
The type classification unit classifies the type of the target form using the trained model, and then
The feature extraction unit extracts the features of the ruled lines in the target form,
Based on the classification result by the type classification unit and the extraction result by the feature extraction unit, the subspecies classification unit performs machine learning using the degree of similarity with the form corresponding to the predefined character recognition definition program. Classify the target forms according to
It is a form sorting method,
The trained model is a model that is trained so that the output obtained by inputting the learning form approaches the type corresponding to the learning form, and is a model that predicts the type of the input form. be,
A form sorting method characterized by that.

Computer,
Acquisition method to acquire the target form, which is the form to be sorted,
A type classification means for classifying the types of the target form using the trained model,
A feature extraction means for extracting the characteristics of ruled lines in the target form,
Based on the classification result by the type classification means and the extraction result by the feature extraction means, the target form is classified by performing machine learning using the degree of similarity with the form corresponding to the definition structure of the character recognition defined in advance. Subspecies classification means,
It is a program to function as
The trained model is a model that is trained so that the output obtained by inputting the learning form approaches the type corresponding to the learning form, and is a model that predicts the type of the input form. be,
program.