JP2015005100A

JP2015005100A - Information processor, template generation method, and program

Info

Publication number: JP2015005100A
Application number: JP2013129479A
Authority: JP
Inventors: 哲史藤川; Satoshi Fujikawa
Original assignee: Hitachi Systems Ltd
Current assignee: Hitachi Systems Ltd
Priority date: 2013-06-20
Filing date: 2013-06-20
Publication date: 2015-01-08

Abstract

PROBLEM TO BE SOLVED: To generate a template where a character region previously described in a document and a region on which a user can enter characters are segmented with high accuracy.SOLUTION: A character discrimination section 24 discriminates fixed characters previously described in a document and characters entered by a user from a plurality of pieces of image information of the document. A region discrimination section 25 discriminates an entry region where the user can enter characters on the document, and a non-entry region where the user cannot enter characters on the document on the basis of the fixed characters and the entered characters discriminated by the character discrimination section 24. A generation section 27 generates a template for the document on the basis of the entry region and the non-entry region discriminated by the region discrimination section 25.

Description

本発明は、帳票の電子化技術に関するものである。 The present invention relates to an electronic form technology.

従来、紙媒体の書類をＯＣＲ（Optical Character Reader）などによって電子化するシステムが存在する。これにより、例えば、紙媒体の帳票を電子化し、または電子化した帳票テンプレートを生成することができる。
なお、従来、位置情報や書式情報の設定作業などの帳票に記載する記入領域の記載位置情報と、その記入領域に関するメタデータの書式情報を出力する情報処理システム、情報処理装置、情報処理方法、およびプログラムが提供されている（例えば、特許文献１参照）。 2. Description of the Related Art Conventionally, there is a system for digitizing paper media using an OCR (Optical Character Reader) or the like. Thereby, for example, a paper medium form can be digitized, or a digitized form template can be generated.
Conventionally, an information processing system, an information processing apparatus, an information processing method for outputting description position information of an entry area described in a form such as position information and format information setting work, and format information of metadata relating to the entry area, And a program are provided (see, for example, Patent Document 1).

特開２００９−２３８２１７号公報JP 2009-238217 A

ＯＣＲなどを用いて、紙媒体の帳票から、電子化した帳票テンプレートを生成する場合、その帳票テンプレートは、帳票の予め記載されている文字の領域と、ユーザが文字を記入できる領域とが、高い精度で区分けされていることが望まれる。 When an electronic form template is generated from a paper medium form using OCR or the like, the form template has a high character area preliminarily written in the form and an area in which the user can enter characters. It is desired to be classified by accuracy.

そこで、本発明は、帳票の予め記載されている文字の領域と、ユーザが文字を記入できる領域とが、高い精度で区分けされたテンプレートを生成すること目的とする。 SUMMARY OF THE INVENTION An object of the present invention is to generate a template in which a character area described in advance on a form and an area in which a user can enter a character are separated with high accuracy.

本願は、上記課題の少なくとも一部を解決する手段を複数含んでいるが、その例を挙げるならば、以下のとおりである。上記課題を解決すべく、本発明に係る情報処理装置は、複数の帳票の画像情報から、前記帳票に予め記載されている固定文字とユーザが記入した記入文字とを識別する文字識別部と、前記文字識別部によって識別された固定文字と記入文字とに基づいて、ユーザが前記帳票に文字を記入できる記入領域とユーザが前記帳票に文字を記入できない非記入領域とを識別する領域識別部と、前記領域識別部によって識別された記入領域と非記入領域とに基づいて、前記帳票のテンプレートを生成する生成部と、を備えることを特徴とする。 The present application includes a plurality of means for solving at least a part of the above-described problems. Examples of such means are as follows. In order to solve the above-described problem, an information processing apparatus according to the present invention, from the image information of a plurality of forms, a character identification unit that identifies fixed characters previously written in the form and entry characters entered by the user, Based on the fixed character and the input character identified by the character identification unit, an area identification unit for identifying an entry area in which a user can enter a character in the form and a non-entry area in which the user cannot enter a character in the form; A generating unit that generates a template of the form based on the entry area and the non-entry area identified by the area identification unit.

また、上記の情報処理装置においては、前記文字識別部は、同種の所定数の前記帳票において、同じ位置に同じ文字列が含まれている場合、前記文字列を固定文字と識別することを特徴とするものであってもよい。 In the above information processing apparatus, the character identification unit identifies the character string as a fixed character when the same character string is included in the same position in a predetermined number of the same type of forms. It may be.

また、上記の情報処理装置においては、前記領域識別部は、前記文字識別部によって固定文字と識別された前記帳票の文字列の領域を非記入領域と識別し、前記文字識別部によって記入文字と識別された前記帳票の文字列の領域を記入領域と識別することを特徴とするものであってもよい。 In the information processing apparatus, the area identification unit identifies a character string area of the form identified as a fixed character by the character identification unit as a non-entry area, and the character identification unit identifies The identified character string area of the form may be identified as an entry area.

また、上記の情報処理装置においては、前記文字識別部は、前記帳票に予め記載されている項目と前記項目に対応してユーザが記入した記入項目とを含む項目情報をさらに用いて、前記帳票に記載されている固定文字と記入文字とを識別するものであってもよい。 In the information processing apparatus, the character identification unit further uses the item information including an item described in advance in the form and an entry entered by the user corresponding to the item. May be used to identify fixed characters and written characters described in the above.

また、上記の情報処理装置においては、前記文字識別部は、前記項目情報の項目に対応する前記帳票の文字列を固定文字と識別し、前記項目情報の記入項目に対応する前記帳票の文字列を記入文字と識別するものであってもよい。 In the information processing apparatus, the character identification unit identifies the character string of the form corresponding to the item information item as a fixed character, and the character string of the form corresponding to the entry item of the item information. May be identified as an entry character.

また、上記の情報処理装置においては、前記文字識別部によって固定文字と識別された前記帳票の文字列の識別に関する学習結果を記憶した記憶部をさらに有し、前記文字識別部は、前記記憶部に記憶されている前記学習結果を用いて、前記帳票に記載されている固定文字を識別するものであってもよい。 The information processing apparatus further includes a storage unit that stores a learning result related to identification of the character string of the form that is identified as a fixed character by the character identification unit, and the character identification unit includes the storage unit The fixed character described in the form may be identified using the learning result stored in the form.

また、上記の情報処理装置においては、前記学習結果は、前記文字識別部によって固定文字と識別された前記帳票の文字列の確率または得点であり、前記文字識別部は、前記帳票に記載されている文字列の確率または得点が所定値以上の場合、固定文字と識別するものであってもよい。 In the information processing apparatus, the learning result is a probability or a score of the character string of the form identified as a fixed character by the character identification unit, and the character identification unit is described in the form If the probability or score of a character string is greater than or equal to a predetermined value, it may be identified as a fixed character.

また、上記の情報処理装置においては、前記学習結果は、前記文字識別部によって固定文字と識別された前記帳票の文字列の確率または得点であり、前記文字識別部は、前記項目情報と前記記憶部に記憶されている確率または得点との一方または両方を用いて、前記帳票に記載されている固定文字と記入文字とを識別するものであってもよい。 In the information processing apparatus, the learning result is a probability or score of a character string of the form identified as a fixed character by the character identification unit, and the character identification unit stores the item information and the storage One or both of the probabilities and scores stored in the section may be used to identify fixed characters and written characters described in the form.

また、上記の情報処理装置においては、前記帳票の画像情報は、所定の方式に基づいてグルーピングされており、１つのグループに固定文字と記入文字とが属している場合、別々のグループに属するようにグループを分割するグルーピング補正部をさらに有することを特徴とするものであってもよい。 In the above information processing apparatus, the image information of the form is grouped based on a predetermined method, and when fixed characters and entry characters belong to one group, they belong to different groups. And a grouping correction unit for dividing the group.

また、本発明に係るテンプレート生成方法は、複数の帳票の画像情報から、前記帳票に予め記載されている固定文字とユーザが記入した記入文字とを識別する文字識別ステップと、前記文字識別ステップによって識別された固定文字と記入文字とに基づいて、ユーザが前記帳票に文字を記入できる記入領域とユーザが前記帳票に記入できない非記入領域とを識別する領域識別ステップと、前記領域識別ステップによって識別された記入領域と非記入領域とに基づいて、前記帳票のテンプレートを生成する生成ステップと、を含むことを特徴とする。 Further, the template generation method according to the present invention includes a character identification step for identifying a fixed character previously written in the form and an entry character entered by the user from the image information of a plurality of forms, and the character identification step. Based on the identified fixed character and entry character, an area identification step for identifying an entry area in which the user can enter a character in the form and a non-entry area in which the user cannot enter the form, and an identification by the area identification step Generating a template for the form based on the filled-in area and the non-filled area.

また、本発明に係る情報処理装置のプログラムは、複数の帳票の画像情報から、前記帳票に予め記載されている固定文字とユーザが記入した記入文字とを識別する文字識別ステップと、前記文字識別ステップによって識別された固定文字と記入文字とに基づいて、ユーザが前記帳票に文字を記入できる記入領域とユーザが前記帳票に記入できない非記入領域とを識別する領域識別ステップと、前記領域識別ステップによって識別された記入領域と非記入領域とに基づいて、前記帳票のテンプレートを生成する生成ステップと、を前記情報処理装置に実行させることを特徴とする。 In addition, the program of the information processing apparatus according to the present invention includes a character identification step for identifying, from image information of a plurality of forms, a fixed character described in advance in the form and an entry character entered by the user, and the character identification An area identification step for identifying an entry area in which a user can enter a character in the form and a non-entry area in which the user cannot fill in the form based on the fixed character and the entry character identified in the step; and the area identification step And generating the template for the form based on the entry area and the non-entry area identified by the information processing apparatus.

上記した以外の課題、構成、および効果は、以下の実施形態の説明により明らかにされる。 Problems, configurations, and effects other than those described above will be clarified by the following description of embodiments.

本発明の一実施形態に係る情報処理装置の概要を説明する図である。It is a figure explaining the outline | summary of the information processing apparatus which concerns on one Embodiment of this invention. 図１の情報処理装置の機能ブロックの一例を示した図である。It is the figure which showed an example of the functional block of the information processing apparatus of FIG. 図１の情報処理装置のハードウェア構成例を示した図である。It is the figure which showed the hardware structural example of the information processing apparatus of FIG. 図１の情報処理装置の動作例を示したフローチャートである。3 is a flowchart illustrating an operation example of the information processing apparatus in FIG. 1. データ変換部でデータ変換される帳票の例を示した図である。It is the figure which showed the example of the form data-converted by a data conversion part. 文字列処理ＴＢの一例を示した図である。It is the figure which showed an example of the character string process TB. 領域処理ＴＢの一例を示した図である。It is the figure which showed an example of area | region process TB. 補足データの一例を示した図である。It is the figure which showed an example of supplementary data. 補足データが付加された文字列処理ＴＢの一例を示した図である。It is the figure which showed an example of the character string process TB to which supplementary data was added. 複数の帳票データによる文字識別を説明する図である。It is a figure explaining the character identification by several form data. 学習情報ＴＢの一例を示した図である。It is the figure which showed an example of learning information TB. 学習情報ＴＢに基づいた文字列処理ＴＢの一例を示した図である。It is the figure which showed an example of the character string process TB based on learning information TB. 文字列処理ＴＢの一例を示した図である。It is the figure which showed an example of the character string process TB. 判定処理ＴＢの一例を示した図である。It is the figure which showed an example of the determination process TB. グルーピング補正後の判定処理ＴＢの一例を示した図である。It is the figure which showed an example of the determination process TB after grouping correction | amendment. 可変長領域を説明する図である。It is a figure explaining a variable-length area | region. 図１６の帳票の判定処理ＴＢの一例を示した図である。FIG. 17 is a diagram illustrating an example of a form determination process TB in FIG. 16. 誤ってグルーピングされた図１６の帳票の判定処理ＴＢの一例を示した図である。FIG. 17 is a diagram illustrating an example of a determination process TB for the form of FIG. 16 that is grouped by mistake. グルーピング補正後の判定処理ＴＢの一例を示した図である。It is the figure which showed an example of the determination process TB after grouping correction | amendment. 電子化帳票の例を示した図である。It is the figure which showed the example of the electronic form. 帳票テンプレートの例を示した図である。It is the figure which showed the example of the form template. テンプレート情報の例を示した図である。It is the figure which showed the example of template information.

以下、本発明の一実施形態について、図面を参照して説明する。 Hereinafter, an embodiment of the present invention will be described with reference to the drawings.

図１は、本発明の一実施形態に係る情報処理装置の概要を説明する図である。図１には、情報処理装置１の他に、情報処理装置１に入力される帳票１１および補足データ１２が示してある。また、図１には、情報処理装置１が生成する電子化帳票１３、帳票テンプレート１４、およびテンプレート情報１５が示してある。 FIG. 1 is a diagram illustrating an overview of an information processing apparatus according to an embodiment of the present invention. In FIG. 1, in addition to the information processing apparatus 1, a form 11 and supplemental data 12 input to the information processing apparatus 1 are shown. FIG. 1 also shows an electronic form 13, form template 14, and template information 15 generated by the information processing apparatus 1.

帳票１１は、例えば、交通費の申請伝票や納品伝票などの帳票である。帳票１１は、例えば、紙媒体の帳票である。 The form 11 is, for example, a form such as a transportation expense application slip or a delivery slip. The form 11 is, for example, a paper medium form.

補足データ１２は、帳票１１に記載されている項目に関する項目情報である。例えば、補足データ１２には、帳票１１に予め記載されている項目と、その項目に対応してユーザが記入した項目内容とが含まれている。 The supplementary data 12 is item information regarding items described in the form 11. For example, the supplementary data 12 includes items previously described in the form 11 and item contents entered by the user corresponding to the items.

具体的には、補足データ１２に含まれる項目は、帳票１１に予め記載（例えば、印刷）されている「申請日」、「申請者」、または「申請費用」などの文字列である。項目内容は、例えば、項目である「申請日」に対応してユーザが記入した申請日、項目である「申請者」に対応してユーザが記入したユーザの氏名、または項目である「申請費用」に対応してユーザが記入した交通費の金額などである。 Specifically, the item included in the supplementary data 12 is a character string such as “application date”, “applicant”, or “application fee” previously described (for example, printed) in the form 11. The item content is, for example, the application date entered by the user corresponding to the item “application date”, the user name entered by the user corresponding to the item “applicant”, or the item “application cost” The amount of transportation expenses entered by the user in response to “

補足データ１２は、例えば、ユーザによって、ＸＭＬ（Extensible Markup Language）により記述される。または、補足データ１２は、ＣＳＶ（Comma Separated Values）やテキストファイルなどであってもよい。 The supplementary data 12 is described by the user in XML (Extensible Markup Language), for example. Alternatively, the supplemental data 12 may be CSV (Comma Separated Values), a text file, or the like.

以下では、帳票１１に予め記載されている項目を固定文字、ユーザが項目に対応して帳票１１に記入した項目内容を記入文字と呼ぶことがある。 In the following, an item preliminarily described in the form 11 may be referred to as a fixed character, and an item content entered in the form 11 by the user corresponding to the item may be referred to as an entry character.

情報処理装置１は、例えば、ＯＣＲ機能によって、複数の帳票１１を読み取る。また、情報処理装置１には、読み取った複数の帳票１１に対応する補足データ１２が入力される。情報処理装置１は、読み取った複数の帳票１１と、読み取った各帳票１１に対応する補足データ１２とに基づいて、電子化帳票１３、帳票テンプレート１４、およびテンプレート情報１５を生成する。 The information processing apparatus 1 reads a plurality of forms 11 using, for example, the OCR function. Further, supplementary data 12 corresponding to the plurality of read forms 11 is input to the information processing apparatus 1. The information processing apparatus 1 generates a computerized form 13, a form template 14, and template information 15 based on the plurality of read forms 11 and the supplementary data 12 corresponding to each read form 11.

電子化帳票１３は、帳票１１を電子化したものである。電子化帳票１３は、例えば、文書ファイルまたは画像ファイルで生成される。紙媒体の帳票１１を電子化帳票１３に置き換えることにより、紙媒体の帳票１１を電子管理することができる。 The electronic form 13 is an electronic form of the form 11. The digitized form 13 is generated, for example, as a document file or an image file. By replacing the paper medium form 11 with the electronic form 13, the paper medium form 11 can be electronically managed.

帳票テンプレート１４は、帳票１１の電子化したテンプレートである。例えば、帳票テンプレート１４は、帳票１１からユーザの記入文字を除去し、除去した記入文字の領域を、ユーザによって編集可能にしたファイルである。帳票テンプレート１４は、例えば、端末装置の表示装置において表示することができ、ユーザは、端末装置上で帳票テンプレート１４に所定事項を記入することができる。 The form template 14 is an electronic template of the form 11. For example, the form template 14 is a file in which user entry characters are removed from the form 11 and the removed entry character area can be edited by the user. The form template 14 can be displayed on a display device of a terminal device, for example, and the user can enter a predetermined item in the form template 14 on the terminal device.

テンプレート情報１５は、帳票テンプレート１４の書式情報である。テンプレート情報１５には、例えば、帳票テンプレート１４に記載されている文字列、位置、書式、ページ余白、用紙サイズ、または罫線書式などの書式情報が含まれている。テンプレート情報１５は、例えば、帳票テンプレート１４とともに帳票１１に関するシステム開発のリソースとして用いることができる。 Template information 15 is format information of the form template 14. The template information 15 includes format information such as a character string, position, format, page margin, paper size, or ruled line format described in the form template 14. The template information 15 can be used as a system development resource for the form 11 together with the form template 14, for example.

上記では、情報処理装置１は、複数の帳票１１と補足データ１２とに基づいて、電子化帳票１３、帳票テンプレート１４、およびテンプレート情報１５を生成するとしたが、補足データ１２を入力することなく、複数の帳票１１に基づいて、電子化帳票１３、帳票テンプレート１４、およびテンプレート情報１５を生成することもできる。すなわち、情報処理装置１は、補足データ１２が入力されなくても、複数の帳票１１から、電子化帳票１３、帳票テンプレート１４、およびテンプレート情報１５を生成することができる。 In the above, the information processing apparatus 1 generates the electronic form 13, the form template 14, and the template information 15 based on the plurality of forms 11 and the supplementary data 12, but without inputting the supplementary data 12, Based on a plurality of forms 11, an electronic form 13, form template 14, and template information 15 can also be generated. That is, the information processing apparatus 1 can generate the digitized form 13, the form template 14, and the template information 15 from the plurality of forms 11 even if the supplementary data 12 is not input.

図２は、図１の情報処理装置の機能ブロックの一例を示した図である。図２に示すように、情報処理装置１は、データ変換部２１、入力部２２、文字補正部２３、文字識別部２４、領域識別部２５、グルーピング補正部２６、生成部２７、および記憶部２８を有している。 FIG. 2 is a diagram illustrating an example of functional blocks of the information processing apparatus in FIG. As illustrated in FIG. 2, the information processing apparatus 1 includes a data conversion unit 21, an input unit 22, a character correction unit 23, a character identification unit 24, a region identification unit 25, a grouping correction unit 26, a generation unit 27, and a storage unit 28. have.

データ変換部２１は、帳票１１に記載されている文字の識別を行う。例えば、データ変換部２１は、ＡＤ（Analog-Digital）変換の機能を備え、帳票１１に記載されている文字列や罫線などの情報をスキャンし、デジタルデータに変換する。 The data conversion unit 21 identifies characters described in the form 11. For example, the data conversion unit 21 has an AD (Analog-Digital) conversion function, scans information such as character strings and ruled lines described in the form 11, and converts them into digital data.

また、データ変換部２１は、ＯＣＲ機能を備え、デジタルデータに変換された帳票１１の文字、文字の位置、または文字のフォント種別やサイズなどの文字書式を認識する。また、データ変換部２１は、汎用的な画像識別処理機能を備え、帳票１１のページ余白、用紙サイズ、用紙向き、罫線、罫線位置、罫線書式を認識し、また、罫線を基準とした領域のグルーピング処理を行う。 The data conversion unit 21 has an OCR function and recognizes character formats such as characters, character positions, or character font type and size of the form 11 converted into digital data. Further, the data conversion unit 21 has a general-purpose image identification processing function, recognizes the page margin, paper size, paper orientation, ruled line, ruled line position, and ruled line format of the form 11, and has an area of the ruled line as a reference. Perform grouping processing.

入力部２２には、補足データ１２が入力される。情報処理装置１は、入力部２２に補足データ１２が入力されなくても、品質のよい電子化帳票１３、帳票テンプレート１４、およびテンプレート情報１５を生成することができるが、補足データ１２が入力されると、より品質のよい電子化帳票１３、帳票テンプレート１４、およびテンプレート情報１５を生成することができる。 Supplemental data 12 is input to the input unit 22. The information processing apparatus 1 can generate a high-quality digitized form 13, form template 14, and template information 15 even if the supplementary data 12 is not input to the input unit 22, but the supplementary data 12 is input. Then, it is possible to generate a digitized form 13, a form template 14, and template information 15 with higher quality.

文字補正部２３は、データ変換部２１が文字を誤変換した場合、誤変換した文字を正しい文字に補正する。文字補正部２３は、ＯＣＲ処理された複数の帳票１１のデータから、誤変換された文字を検出し、補正する。または、文字補正部２３は、入力部２２に補足データ１２が入力された場合、その補足データ１２を用いて、誤変換された文字を検出し、補正する。 When the data conversion unit 21 erroneously converts a character, the character correction unit 23 corrects the erroneously converted character to a correct character. The character correction unit 23 detects and corrects erroneously converted characters from the data of the plurality of forms 11 subjected to OCR processing. Alternatively, when the supplementary data 12 is input to the input unit 22, the character correction unit 23 uses the supplementary data 12 to detect and correct erroneously converted characters.

文字識別部２４は、ＯＣＲ処理された帳票１１に記載されている文字列が、固定文字（帳票１１の項目）であるか、ユーザが記入した記入文字（帳票１１の項目内容）であるかを識別する。文字識別部２４は、例えば、固定文字と記入文字との識別を次の３つの方法によって行うことができる。なお、文字識別部２４は、方法１だけで固定文字と記入文字との識別を行うことができるが、方法１〜３のすべてを用いて固定文字と記入文字との識別を行ってもよい。また、文字識別部２４は、方法１と方法２または方法１と方法３によって、固定文字と記入文字との識別を行ってもよい。 The character identification unit 24 determines whether the character string written in the OCR-processed form 11 is a fixed character (item of the form 11) or a character entered by the user (item content of the form 11). Identify. The character identifying unit 24 can identify, for example, a fixed character and a written character by the following three methods. In addition, although the character identification part 24 can identify a fixed character and an entry character only by the method 1, you may identify a fixed character and an entry character using all the methods 1-3. In addition, the character identification unit 24 may identify a fixed character and a written character by Method 1 and Method 2 or Method 1 and Method 3.

１．複数の帳票１１のデータを用いた識別
帳票１１に記載されている固定文字は、一般的に、複数の帳票１１にわたって同じ位置に同じ文字で記載されている。一方、帳票１１に記載されている記入文字は、ユーザが記入するものであるため、複数の帳票１１にわたっては、その内容が異なる場合がある。これにより、文字識別部２４は、複数の帳票１１のデータから、固定文字と記入文字とを識別することができる。 1. Identification using data of a plurality of forms 11 A fixed character described in a form 11 is generally written in the same position at the same position across the plurality of forms 11. On the other hand, since the entry characters described in the form 11 are entered by the user, the contents may be different across a plurality of forms 11. Thereby, the character identification part 24 can identify a fixed character and an entry character from the data of a plurality of forms 11.

例えば、文字識別部２４は、同種の所定数の帳票１１にわたって、同じ位置に同じ文字列が記載されていた場合、その位置の文字列を固定文字と識別する。また、文字識別部２４は、同種の所定数の帳票１１にわたって、同じ位置に異なる文字列が記載されている場合、その位置の文字列を記入文字と識別する。所定数は、例えば、２以上であればよいが、その数が多ければ多いほどよい。 For example, when the same character string is written at the same position over a predetermined number of forms 11 of the same type, the character identification unit 24 identifies the character string at that position as a fixed character. In addition, when different character strings are written at the same position across a predetermined number of forms 11 of the same type, the character identification unit 24 identifies the character string at that position as an entry character. The predetermined number may be two or more, for example, but the larger the number, the better.

２．学習による識別
文字識別部２４は、認識した固定文字を学習して、帳票１１の固定文字と記入文字とを識別する。 2. Identification by Learning The character identification unit 24 learns the recognized fixed character and identifies the fixed character and the entry character of the form 11.

例えば、文字識別部２４は、固定文字と識別した帳票の文字列の識別に関する学習結果を記憶部２８に記憶する。より具体的には、文字識別部２４は、ＯＣＲ処理された文字列が過去にどのくらいの確率で固定文字と識別したかの学習テーブルを記憶部２８に記憶する。文字識別部２４は、記憶部２８に記憶した学習テーブルを参照し、ＯＣＲ処理された帳票１１の文字列が固定文字であるか記入文字であるかを識別する。
なお、上記の固定文字の学習は、確率を用いて説明しているが、得点であってもよい。例えば、学習アルゴリズムで得られる学習結果は、得点によって示される場合もあるからである。以下で説明する学習においても同様である。 For example, the character identification unit 24 stores a learning result related to identification of a character string of a form identified as a fixed character in the storage unit 28. More specifically, the character identification unit 24 stores, in the storage unit 28, a learning table indicating the probability that the OCR-processed character string has been identified as a fixed character in the past. The character identification unit 24 refers to the learning table stored in the storage unit 28 and identifies whether the character string of the form 11 subjected to the OCR process is a fixed character or an input character.
In addition, although learning of said fixed character is demonstrated using the probability, a score may be sufficient. For example, the learning result obtained by the learning algorithm may be indicated by a score. The same applies to the learning described below.

３．補足データ１２を用いた識別
図１で説明したように、補足データ１２は、帳票１１の項目（固定文字）と、項目内容（記入文字）との情報を有している。文字識別部２４は、固定文字と記入文字との情報を含む補足データ１２を参照し、ＯＣＲ処理された文字列を、固定文字であるか記入文字であるか識別する。 3. Identification Using Supplementary Data 12 As described with reference to FIG. 1, the supplementary data 12 includes information on items (fixed characters) of the form 11 and item contents (entry characters). The character identifying unit 24 refers to the supplementary data 12 including information on fixed characters and entry characters, and identifies whether the character string subjected to OCR processing is a fixed character or an entry character.

例えば、文字識別部２４は、ＯＣＲ処理された帳票１１の文字例と、入力された補足データ１２とを比較する。文字識別部２４は、ＯＣＲ処理された帳票１１の文字列が、補足データ１２の項目と一致する場合、その文字列を固定文字と識別する。また、文字識別部２４は、ＯＣＲ処理された帳票１１の文字列が、補足データ１２の項目内容と一致する場合、その文字列を記入文字と識別する。 For example, the character identification unit 24 compares the character example of the form 11 that has been subjected to the OCR processing with the input supplemental data 12. When the character string of the form 11 that has been subjected to the OCR processing matches the item of the supplementary data 12, the character identification unit 24 identifies the character string as a fixed character. In addition, when the character string of the OCR-processed form 11 matches the item content of the supplementary data 12, the character identification unit 24 identifies the character string as an entry character.

文字識別部２４は、上記の３つの方法のそれぞれによって固定文字であるか記入文字であるかの識別を行い、その３つの識別結果を統合して、最終的な固定文字と記入文字との識別を行う。なお、文字識別部２４は、３つの識別結果を統合しなくても、方法１の結果だけで固定文字と記入文字との識別を行ってもよい。また、文字識別部２４は、方法１と方法２との結果を統合して、固定文字と記入文字との識別を行ってもよい。また、文字識別部２４は、方法１と方法３との結果を統合して、固定文字と記入文字との識別を行ってもよい。 The character identification unit 24 identifies whether the character is a fixed character or an input character by each of the above three methods, and integrates the three identification results to identify the final fixed character and the input character. I do. Note that the character identification unit 24 may identify a fixed character and a written character only by the result of the method 1 without integrating the three identification results. Moreover, the character identification part 24 may integrate the result of the method 1 and the method 2, and may identify a fixed character and a written character. Moreover, the character identification part 24 may integrate the result of the method 1 and the method 3, and may identify a fixed character and an entry character.

領域識別部２５は、文字識別部２４によって識別された固定文字と記入文字とに基づいて、ユーザが帳票１１に記入できる記入領域と、ユーザが帳票１１に記入できない非記入領域とを識別する。 The area identifying unit 25 identifies an entry area in which the user can fill in the form 11 and a non-filled area in which the user cannot fill in the form 11 based on the fixed character and the entered character identified by the character identifying unit 24.

例えば、領域識別部２５は、文字識別部２４によって固定文字と識別された帳票１１の文字列の領域を非記入領域と識別する。また、領域識別部２５は、文字識別部２４によって記入文字と識別された帳票１１の文字列の領域を記入領域と識別する。 For example, the area identifying unit 25 identifies the area of the character string of the form 11 identified as a fixed character by the character identifying unit 24 as a non-filled area. Further, the area identifying unit 25 identifies the area of the character string of the form 11 identified as the entered character by the character identifying unit 24 as the entered area.

グルーピング補正部２６は、データ変換部２１が処理したグルーピングを補正する。 The grouping correction unit 26 corrects the grouping processed by the data conversion unit 21.

例えば、上記したように、データ変換部２１は、罫線を基準とした領域のグルーピング処理を行う。データ変換部２１の行ったグルーピングには、１つのグループに、固定文字と記入文字とが属している場合がある。この場合、グルーピング補正部２６は、固定文字と記入文字とが別々のグループとなるようにグループを分割する。すなわち、グルーピング補正部２６は、生成される帳票テンプレート１４において、ユーザが文字を記入できる領域と、記入できない領域とが別々となるようにする。 For example, as described above, the data conversion unit 21 performs a grouping process of regions based on ruled lines. In the grouping performed by the data conversion unit 21, a fixed character and an entry character may belong to one group. In this case, the grouping correction unit 26 divides the group so that the fixed character and the entered character are in separate groups. In other words, the grouping correction unit 26 causes the area in which the user can enter characters and the area in which characters cannot be entered in the generated form template 14 to be different.

生成部２７は、データ変換部２１によってデータ変換されたデータに基づいて、帳票１１の電子化帳票１３を生成する。 The generation unit 27 generates the electronic form 13 of the form 11 based on the data converted by the data conversion unit 21.

また、生成部２７は、データ変換部２１によってデータ変換されたデータに基づいて、帳票１１の帳票テンプレート１４を生成する。その際、生成部２７は、領域識別部２５によって識別された記入領域および非記入領域に基づき、帳票１１の記入領域の文字列を削除して帳票テンプレート１４を生成する。 The generation unit 27 generates the form template 14 of the form 11 based on the data converted by the data conversion unit 21. At that time, the generation unit 27 deletes the character string in the entry area of the form 11 and generates the form template 14 based on the entry area and the non-entry area identified by the area identification unit 25.

また、生成部２７は、生成する帳票テンプレート１４のテンプレート情報１５を生成する。すなわち、生成部２７は、帳票テンプレート１４の書式情報を生成する。 Further, the generation unit 27 generates template information 15 of the form template 14 to be generated. That is, the generation unit 27 generates format information of the form template 14.

記憶部２８には、データＴＢ（ＴＢ：Table）が一時的に格納される。情報処理装置１の各部は、処理を実行する際、データＴＢを記憶部２８に生成したり、生成したデータＴＢを参照したりする。 Data TB (TB: Table) is temporarily stored in the storage unit 28. Each unit of the information processing apparatus 1 generates data TB in the storage unit 28 or refers to the generated data TB when executing processing.

次に、図２に示した機能ブロック例の機能を実現するためのハードウェア構成例について説明する。 Next, a hardware configuration example for realizing the functions of the functional block example shown in FIG. 2 will be described.

図３は、図１の情報処理装置のハードウェア構成例を示した図である。図３に示すように、情報処理装置１は、ＣＰＵ（Central Processing Unit）３１、ＲＡＭ（Random Access Memory）３２、ＲＯＭ（Read Only Memory）３３、ＨＤＤ（Hard Disk Drive）３４、スキャナ３５、ドライブ３６、通信インタフェース３７、およびバス３８を有している。 FIG. 3 is a diagram illustrating a hardware configuration example of the information processing apparatus of FIG. As shown in FIG. 3, the information processing apparatus 1 includes a CPU (Central Processing Unit) 31, a RAM (Random Access Memory) 32, a ROM (Read Only Memory) 33, an HDD (Hard Disk Drive) 34, a scanner 35, and a drive 36. , A communication interface 37, and a bus 38.

ＣＰＵ３１は、情報処理装置１全体の制御を行う。ＣＰＵ３１には、バス３８を介してＲＡＭ３２、ＲＯＭ３３、ＨＤＤ３４、スキャナ３５、ドライブ３６、および通信インタフェース３７が接続されている。 The CPU 31 controls the information processing apparatus 1 as a whole. A RAM 32, a ROM 33, an HDD 34, a scanner 35, a drive 36, and a communication interface 37 are connected to the CPU 31 via a bus 38.

ＲＡＭ３２およびＲＯＭ３３には、ＣＰＵ３１に実行させるＯＳ（Operating System）のプログラムや、電子化帳票１３、帳票テンプレート１４、およびテンプレート情報１５を生成するためのアプリケーションプログラムの少なくとも一部が一時的に格納される。また、ＲＡＭ３２には、ＣＰＵ３１による処理に必要な各種データが保存される。 The RAM 32 and the ROM 33 temporarily store at least a part of an OS (Operating System) program to be executed by the CPU 31 and an application program for generating the computerized form 13, the form template 14, and the template information 15. . The RAM 32 stores various data necessary for processing by the CPU 31.

ＲＯＭ３３およびＨＤＤ３４には、ＯＳやアプリケーションプログラム、各種データが格納される。 The ROM 33 and HDD 34 store the OS, application programs, and various data.

スキャナ３５は、光や磁気などを用いて情報を読み取る装置である。例えば、スキャナ３５は、帳票１１に記載されている文字や罫線などを、光によって読み取る。 The scanner 35 is a device that reads information using light or magnetism. For example, the scanner 35 reads characters, ruled lines, and the like described in the form 11 with light.

ドライブ３６は、外部の記憶媒体にデータを書き込んだり、外部の記憶媒体に記憶されているデータを読み出したりする装置である。ドライブ３６は、例えば、ＣＤ（Compact Disc）ドライブやＤＶＤ（Digital Versatile Disc）ドライブなどである。 The drive 36 is a device that writes data to an external storage medium and reads data stored in an external storage medium. The drive 36 is, for example, a CD (Compact Disc) drive or a DVD (Digital Versatile Disc) drive.

通信インタフェース３７は、ネットワークに接続され、ＣＰＵ３１の指示に応じて、外部の装置と通信を行う。通信インタフェース３７は、例えば、ＬＡＮ（Local Area Network）アダプタである。 The communication interface 37 is connected to a network and communicates with an external device in accordance with an instruction from the CPU 31. The communication interface 37 is, for example, a LAN (Local Area Network) adapter.

上記の図２および図３に示した情報処理装置１の構成は、理解を容易にするために、主な処理内容に応じて分類したものである。構成要素の分類の仕方や名称によって、本願発明が制限されることはない。例えば、情報処理装置１の構成は、処理内容に応じて、さらに多くの構成要素に分類することもできる。また、１つの構成要素がさらに多くの処理を実行するように分類することもできる。各構成要素の処理は、１つのハードウェアで実行されてもよいし、複数のハードウェアで実行されてもよい。また、各構成要素の処理は、１つのプログラムで実現されてもよいし、複数のプログラムで実現されてもよい。 The configuration of the information processing apparatus 1 shown in FIG. 2 and FIG. 3 is classified according to main processing contents in order to facilitate understanding. The present invention is not limited by the way of classification and names of the constituent elements. For example, the configuration of the information processing apparatus 1 can be classified into more components depending on the processing content. Moreover, it can also classify | categorize so that one component may perform more processes. The processing of each component may be executed by one hardware, or may be executed by a plurality of hardware. Further, the processing of each component may be realized by one program or may be realized by a plurality of programs.

上記の情報処理装置１の処理機能は、コンピュータによって実現することができる。その場合、情報処理装置１が有する機能の処理内容を記述したプログラムが提供される。そのプログラムをコンピュータで実行することにより、上記処理機能がコンピュータ上で実現される。処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、磁気記憶装置、光ディスク、光磁気記録媒体、半導体メモリ等が挙げられる。磁気記憶装置には、ハードディスクドライブ、フレキシブルディスク（ＦＤ）、磁気テープ等が挙げられる。光ディスクには、Ｂｌｕ−ｒａｙ（登録商標）、ＤＶＤ（Digital Versatile Disc）、ＤＶＤ−ＲＡＭ、ＣＤ−ＲＯＭ（Compact Disc Read Only Memory）／ＲＷ（ReWritable）等が挙げられる。光磁気記録媒体には、ＭＯ（Magneto-Optical disk）等が挙げられる。 The processing function of the information processing apparatus 1 can be realized by a computer. In that case, a program describing the processing contents of the functions of the information processing apparatus 1 is provided. By executing the program on a computer, the above processing functions are realized on the computer. The program describing the processing contents can be recorded on a computer-readable recording medium. Examples of the computer-readable recording medium include a magnetic storage device, an optical disk, a magneto-optical recording medium, and a semiconductor memory. Examples of the magnetic storage device include a hard disk drive, a flexible disk (FD), and a magnetic tape. Examples of the optical disc include Blu-ray (registered trademark), DVD (Digital Versatile Disc), DVD-RAM, CD-ROM (Compact Disc Read Only Memory) / RW (ReWritable), and the like. Examples of the magneto-optical recording medium include an MO (Magneto-Optical disk).

プログラムを流通させる場合には、例えば、そのプログラムが記録されたＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体が販売される。また、プログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することもできる。 When distributing the program, for example, a portable recording medium such as a DVD or a CD-ROM in which the program is recorded is sold. It is also possible to store the program in a storage device of a server computer and transfer the program from the server computer to another computer via a network.

プログラムを実行するコンピュータは、例えば、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、自己の記憶装置に格納する。そして、コンピュータは、自己の記憶装置からプログラムを読み取り、プログラムに従った処理を実行する。なお、コンピュータは、可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することもできる。また、コンピュータは、ネットワークを介して接続されたサーバコンピュータからプログラムが転送される毎に、逐次、受け取ったプログラムに従った処理を実行することもできる。 The computer that executes the program stores, for example, the program recorded on the portable recording medium or the program transferred from the server computer in its own storage device. Then, the computer reads the program from its own storage device and executes processing according to the program. The computer can also read the program directly from the portable recording medium and execute processing according to the program. In addition, each time a program is transferred from a server computer connected via a network, the computer can sequentially execute processing according to the received program.

また、上記の処理機能の少なくとも一部または全部を、ＤＳＰ（Digital Signal Processor）、ＡＳＩＣ（Application Specific Integrated Circuit）、またはＰＬＤ（Programmable Logic Device）等の電子回路で実現することもできる。 Further, at least a part or all of the above processing functions can be realized by an electronic circuit such as a DSP (Digital Signal Processor), an ASIC (Application Specific Integrated Circuit), or a PLD (Programmable Logic Device).

情報処理装置１の動作について、フローチャートを用いて説明する。 The operation of the information processing apparatus 1 will be described using a flowchart.

図４は、図１の情報処理装置の動作例を示したフローチャートである。図４のフローチャートは、例えば、ユーザからの指示に応じて処理が開始される。 FIG. 4 is a flowchart illustrating an operation example of the information processing apparatus of FIG. In the flowchart of FIG. 4, for example, processing is started in response to an instruction from the user.

［ステップＳ１］データ変換部２１は、紙媒体の帳票１１を電子データに変換し、以下で説明する文字列処理ＴＢと領域処理ＴＢとを生成する。文字列処理ＴＢと領域処理ＴＢとを説明する前に、電子データに変換される帳票１１について説明する。 [Step S1] The data converter 21 converts the paper medium form 11 into electronic data, and generates character string processing TB and area processing TB described below. Before describing the character string process TB and the area process TB, the form 11 converted into electronic data will be described.

図５は、データ変換部でデータ変換される帳票の例を示した図である。図５に示す帳票１１は、交通費申請書の例を示している。 FIG. 5 is a diagram showing an example of a form whose data is converted by the data conversion unit. A form 11 shown in FIG. 5 shows an example of a transportation expense application form.

図５に示す「交通費申請書」、「申請日」、「申請名称」、「申請者」、「申請費用」、および「理由：」は、予め帳票１１に記載されている文字である。すなわち、これらの文字は、項目であり、固定文字である。 “Transportation cost application form”, “application date”, “application name”, “applicant”, “application cost”, and “reason:” shown in FIG. That is, these characters are items and fixed characters.

図５に示す「平成２４年１０月１０日」、「○○県視察交通費申請」、「○○ 太郎」、「○○次郎」、「￥１，０００」、および「○○県視察のため」は、ユーザが帳票１１に記入した文字である。すなわち、これらの文字は、項目内容であり、記入文字である。データ変換部２１は、図５に示すような帳票１１をデータ変換する。 “October 10, 2012”, “Application for Transportation Expenses for XX Prefecture”, “Taro XX”, “Jiro XX”, “¥ 1,000”, and “ “For” is a character entered by the user in the form 11. That is, these characters are item contents and entry characters. The data conversion unit 21 converts the form 11 as shown in FIG.

文字列処理ＴＢについて説明する。データ変換部２１は、帳票１１の文字列を認識し、各種書式情報の取得を行う。例えば、データ変換部２１は、ＯＣＲ機能によって、帳票１１の文字列を認識し、文字列の位置やフォント種別、サイズなどの文字書式を取得する。データ変換部２１は、取得した文字列の情報に基づいて、文字列処理ＴＢを生成する。 The character string processing TB will be described. The data conversion unit 21 recognizes the character string of the form 11 and acquires various format information. For example, the data conversion unit 21 recognizes the character string of the form 11 by the OCR function, and acquires the character format such as the position, font type, and size of the character string. The data conversion unit 21 generates a character string process TB based on the acquired character string information.

図６は、文字列処理ＴＢの一例を示した図である。データ変換部２１は、例えば、図５に示した帳票１１の文字列を認識し、図６に示すような文字列処理ＴＢ４１を記憶部２８に生成する。 FIG. 6 is a diagram illustrating an example of the character string process TB. For example, the data conversion unit 21 recognizes the character string of the form 11 shown in FIG. 5 and generates a character string process TB41 as shown in FIG.

文字列処理ＴＢ４１の「Ｎｏ．」の欄には、認識した文字列に付与した番号が格納される。 The number assigned to the recognized character string is stored in the “No.” field of the character string processing TB41.

「文字列」の欄には、帳票１１から認識した文字列が格納される。例えば、図５に示した帳票１１には、「申請名称」や「○○県視察交通費申請」という文字列が記載されている。従って、文字列処理ＴＢ４１の欄には、これらの文字列が格納されている。 The character string recognized from the form 11 is stored in the “character string” column. For example, in the form 11 shown in FIG. 5, a character string “application name” or “application for XX prefecture inspection transportation expenses” is described. Therefore, these character strings are stored in the column of the character string processing TB41.

「位置」の欄には、認識した文字列の位置情報が格納される。「位置」の欄は、さらに「Ｙ」の欄と「Ｘ」の欄とに分かれている。「Ｙ」の欄には、認識した文字列の垂直方向における位置情報が格納される。「Ｘ」の欄には、認識した文字列の水平方向における位置情報が格納される。 The “position” column stores position information of the recognized character string. The “position” column is further divided into a “Y” column and an “X” column. In the “Y” column, position information in the vertical direction of the recognized character string is stored. In the “X” column, position information in the horizontal direction of the recognized character string is stored.

領域処理ＴＢについて説明する。データ変換部２１は、画像識別処理機能を備え、帳票１１のページ余白、用紙サイズ、用紙向き、罫線、罫線位置、罫線書式を認識し、また、罫線を基準とした領域のグルーピング処理を行う。グルーピングについては、多様な手法が存在するが、ここでは、データ変換部２１は、例えば、罫線に囲まれた領域を１領域とグルーピングする。データ変換部２１は、前記のグルーピング処理に基づいて、領域処理ＴＢを生成する。 The area process TB will be described. The data conversion unit 21 has an image identification processing function, recognizes the page margin, paper size, paper orientation, ruled line, ruled line position, and ruled line format of the form 11, and performs a grouping process of areas based on the ruled line. There are various methods for grouping. Here, for example, the data conversion unit 21 groups an area surrounded by a ruled line into one area. The data conversion unit 21 generates a region process TB based on the grouping process.

図７は、領域処理ＴＢの一例を示した図である。データ変換部２１は、例えば、図５に示した帳票１１の領域を認識し、図７に示すような領域処理ＴＢ４２を記憶部２８に生成する。 FIG. 7 is a diagram illustrating an example of the area process TB. For example, the data converter 21 recognizes the area of the form 11 shown in FIG. 5 and generates the area process TB 42 as shown in FIG.

領域処理ＴＢ４２の「Ｎｏ．」の欄には、グルーピングした文字列に付与した番号が格納される。 The number assigned to the grouped character string is stored in the “No.” column of the area process TB42.

「文字列」の欄には、帳票１１のグルーピングした文字列が格納される。例えば、図５の例の場合、「申請名称」は、１つの罫線枠に囲まれており、図７の「Ｎｏ．１」の欄にグルーピングされている。また、図５の「理由：○○県視察のため」は、罫線枠外のまとまった領域に記載されているため、図７の「Ｎｏ．７」の欄にグルーピングされている。 In the “character string” column, the character strings grouped in the form 11 are stored. For example, in the case of the example in FIG. 5, the “application name” is surrounded by one ruled line frame and grouped in the column “No. 1” in FIG. Further, “Reason: For XX prefecture inspection” in FIG. 5 is grouped in the column of “No. 7” in FIG. 7 because it is described in a grouped area outside the ruled line frame.

「開始位置」の欄には、グルーピングされた領域の開始位置が格納される。「終了位置」の欄には、グルーピングされた領域の終了位置が格納される。グルーピングの領域は、例えば、四角形状であり、「開始位置」と「終了位置」は、四角形状の対角線上にある頂点で示される。図７の例の場合、「申請名称」は、１つのグループ（領域）としてグルーピングされており、その領域の開始位置は、（Ｙ，Ｘ）＝（２０，１０）であり、終了位置は、（Ｘ，Ｙ）＝（３０，４０）である。 The “start position” field stores the start position of the grouped area. The “end position” column stores the end position of the grouped area. The grouping area is, for example, a quadrangular shape, and the “start position” and “end position” are indicated by vertices on a square diagonal line. In the example of FIG. 7, “application names” are grouped as one group (area), the start position of the area is (Y, X) = (20, 10), and the end position is (X, Y) = (30, 40).

このように、データ変換部２１は、紙媒体の帳票１１をスキャンし、文字列処理ＴＢ４１および領域処理ＴＢ４２を生成する。 In this manner, the data conversion unit 21 scans the paper medium form 11 and generates the character string process TB 41 and the area process TB 42.

なお、データ変換部２１は、複数の同種の帳票１１を電子データに変換する。例えば、データ変換部２１は、図５に示した帳票１１と同じフォーマット（例えば、項目、罫線が同じ）で、別のユーザが記入した項目内容の異なる複数の帳票１１をデータ変換する。そして、データ変換部２１は、複数の帳票１１のそれぞれの文字列処理ＴＢ４１と領域処理ＴＢ４２とを生成する。 The data conversion unit 21 converts a plurality of the same type of forms 11 into electronic data. For example, the data conversion unit 21 performs data conversion on a plurality of forms 11 having the same format (for example, items and ruled lines are the same) as the form 11 shown in FIG. Then, the data conversion unit 21 generates a character string process TB41 and an area process TB42 for each of the plurality of forms 11.

図４の説明に戻る。 Returning to the description of FIG.

［ステップＳ２］次に入力部２２は、データ変換される帳票１１に対応する補足データ１２を受付ける。入力部２２は、補足データ１２が入力された場合、データ変換部２１が生成した文字列処理ＴＢ４１に、入力された補足データ１２の情報を付加する。補足データ１２の情報が付加された文字列処理ＴＢを説明する前に、補足データ１２について説明する。 [Step S2] Next, the input unit 22 receives supplementary data 12 corresponding to the form 11 to be converted. When the supplementary data 12 is input, the input unit 22 adds the information of the input supplemental data 12 to the character string process TB 41 generated by the data conversion unit 21. Before explaining the character string processing TB to which the information of the supplementary data 12 is added, the supplementary data 12 will be explained.

図８は、補足データの一例を示した図である。図８に示す補足データ１２は、図５で説明した帳票１１の補足データ例を示している。 FIG. 8 is a diagram showing an example of supplementary data. Supplementary data 12 shown in FIG. 8 shows an example of supplementary data of the form 11 described in FIG.

補足データ１２は、項目および項目内容の情報を含んでいる。項目は、例えば、＜項目名＞，＜／項目名＞というタグの間に記述され、項目内容は、例えば、＜内容＞，＜／内容＞というタグの間に記述される。 The supplementary data 12 includes information on items and item contents. The item is described between tags <item name> and </ item name>, for example, and the item content is described between tags <content> and </ content>, for example.

具体的には、図５に示した「申請名称」は項目であり、「○○県視察交通費申請」は項目内容である。従って、図５の例の場合の補足データ１２は、図８に示すように、＜項目名＞と＜／項目名＞との間に「申請名称」が記述され、＜内容＞と＜／内容＞との間に「○○県視察交通費申請」が記述される。 Specifically, the “application name” shown in FIG. 5 is an item, and “XX prefecture inspection transportation fee application” is the item content. Accordingly, in the supplementary data 12 in the example of FIG. 5, as shown in FIG. 8, “application name” is described between <item name> and </ item name>, and <content> and </ content Between “and” is the “Application for transportation inspection for XX prefecture”.

補足データ１２は、項目と項目内容とが識別できればよい。従って、補足データ１２は、例えば、ＸＭＬやＣＶＳ、テキストファイルなどによって作成される。 The supplementary data 12 only needs to identify the item and the item content. Accordingly, the supplementary data 12 is created by, for example, XML, CVS, a text file, or the like.

補足データ１２は、より品質のよい電子化帳票１３、帳票テンプレート１４、およびテンプレート情報１５を生成するためのものである。従って、情報処理装置１は、入力部２２に補足データが入力されなくても、複数の同種の帳票１１から、その帳票１１の電子化帳票１３、帳票テンプレート１４、およびテンプレート情報１５を生成することができる。 The supplementary data 12 is for generating a digitized form 13, a form template 14, and template information 15 with higher quality. Therefore, the information processing apparatus 1 generates the computerized form 13, form template 14, and template information 15 of the form 11 from a plurality of the same type of forms 11 even if supplementary data is not input to the input unit 22. Can do.

このような品質向上のための補足データ１２は、帳票１１の固定文字である項目と、ユーザが記入した記入文字である項目内容とを含んでいればよい。従って、補足データ１２は、例えば、帳票１１のレイアウトに依存した、項目の絶対位置・相対位置に関する情報や、システム環境に依存する文字種別といった情報を含まなくてよい。もちろん、補足データ１２は、生成する情報の品質向上のために、その他の付加情報を含んでいてもよい。 Such supplementary data 12 for improving quality may include items that are fixed characters of the form 11 and item contents that are written characters entered by the user. Therefore, the supplementary data 12 may not include information on the absolute position / relative position of items depending on the layout of the form 11 or information on the character type depending on the system environment. Of course, the supplementary data 12 may include other additional information in order to improve the quality of information to be generated.

補足データ１２の情報が付加された文字列処理ＴＢについて説明する。上記したように、入力部２２は、補足データ１２が入力された場合、データ変換部２１が生成した文字列処理ＴＢに補足データ１２の情報を付加する。 The character string process TB to which the supplementary data 12 information is added will be described. As described above, when the supplementary data 12 is input, the input unit 22 adds the information of the supplementary data 12 to the character string process TB generated by the data conversion unit 21.

図９は、補足データが付加された文字列処理ＴＢの一例を示した図である。図９に示すように、補足データ１２が付加された文字列処理ＴＢ４３は、「変換結果」の欄および「補足データ情報」の欄を有している。 FIG. 9 is a diagram showing an example of a character string process TB to which supplementary data is added. As shown in FIG. 9, the character string processing TB43 to which the supplementary data 12 is added has a “conversion result” column and a “supplemental data information” column.

「変換結果」の欄の情報は、データ変換部２１によって生成された文字列処理ＴＢの情報に対応する。例えば、図９の「変換結果」の欄の情報は、図６で説明した文字列処理ＴＢ４１の情報に対応している。 The information in the “conversion result” column corresponds to the information on the character string processing TB generated by the data conversion unit 21. For example, the information in the “conversion result” column of FIG. 9 corresponds to the information of the character string processing TB 41 described in FIG.

「補足データ情報」の欄には、入力部２２に入力された補足データ１２の情報が格納される。「補足データ情報」の欄は、「文字列」の欄、「項目名」の欄、「項目／内容区分」の欄、および「固定文字確率」の欄を有している。 In the “supplemental data information” field, information on the supplementary data 12 input to the input unit 22 is stored. The “supplemental data information” field includes a “character string” field, an “item name” field, an “item / content category” field, and a “fixed character probability” field.

「文字列」の欄には、補足データ１２の項目および項目内容の文字列が格納される。例えば、上記の図８の例では、項目の文字列は「申請名称」であり、項目内容の文字列は「○○県視察交通費申請」である。従って、図９の「補足データ情報」の欄の「文字列」の欄には、「申請名称」および「○○県視察交通費申請」が格納されている。なお、「補足データ情報」の欄の「文字列」の欄に格納される文字列は、「変換結果」の欄の「文字列」の欄の文字列に対応して格納される。 In the “character string” column, items of supplementary data 12 and character strings of item contents are stored. For example, in the example of FIG. 8 described above, the character string of the item is “application name”, and the character string of the item content is “XX prefecture inspection transportation fee application”. Therefore, in the “Character string” column of the “Supplementary data information” column in FIG. 9, “Application name” and “Application for transportation expenses for XX prefecture inspection” are stored. The character string stored in the “character string” field in the “supplemental data information” field is stored in correspondence with the character string in the “character string” field in the “conversion result” field.

「項目名」の欄には、項目および項目内容の名称が格納される。例えば、項目「申請名称」の項目名は、「申請名称」となっている。また、項目内容「○○県視察交通費申請」の項目名は、「申請名称」となっている。 In the “item name” column, names of items and item contents are stored. For example, the item name of the item “application name” is “application name”. In addition, the item name of “Application for Expenses for Exploring XX Prefecture” is “Application Name”.

「項目／内容区分」の欄には、「補足データ情報」の欄の「文字列」の欄に格納された文字列が、項目であるか、項目内容であるかの情報が格納される。例えば、「申請名称」は、図８に示した補足データ１２の例より、項目であるので、その「項目／内容区分」の欄には、項目を示す「項目」が格納される。また、「○○県視察交通費申請」は、図８に示した補足データ１２の例より、項目内容であるので、その「項目／内容区分」の欄には、項目内容を示す「内容」が格納される。 In the “item / content category” column, information indicating whether the character string stored in the “character string” column in the “supplemental data information” column is an item or item content is stored. For example, the “application name” is an item from the example of the supplementary data 12 shown in FIG. 8, and therefore, an “item” indicating an item is stored in the “item / content category” column. In addition, “XX prefecture inspection transportation fee application” is an item content from the example of the supplementary data 12 shown in FIG. 8, and therefore, in the “item / content category” column, “content” indicating the item content is displayed. Is stored.

「固定文字確率」の欄には、「補足データ情報」の欄の「文字列」の欄に記載された文字列が、固定文字である確率が格納される。例えば、「申請名称」は、項目であるので、固定文字である。従って、文字列「申請名称」に対応する「固定文字確率」の欄には、固定文字率が１００％であることを示す「１００％」という情報が格納される。また、「○○県視察交通費申請」は、項目内容であるので、ユーザが記入した記入文字である。従って、文字列「○○県視察交通費申請」に対応する「固定文字確率」の欄には、固定文字率が０％であることを示す「０％」という情報が格納される。 In the “fixed character probability” column, the probability that the character string described in the “character string” column in the “supplemental data information” column is a fixed character is stored. For example, since “application name” is an item, it is a fixed character. Therefore, the information “100%” indicating that the fixed character rate is 100% is stored in the “fixed character probability” column corresponding to the character string “application name”. In addition, “XX prefecture inspection transportation fee application” is an entry character, because it is an item content. Therefore, the information “0%” indicating that the fixed character rate is 0% is stored in the “fixed character probability” column corresponding to the character string “application for XX prefecture inspection and transportation expenses”.

このように、補足データ１２入力された場合、データ変換部２１によってデータ変換された帳票１１の文字列が、固定文字であるか記入文字であるか、正確に分かる。 As described above, when the supplementary data 12 is input, it is possible to accurately know whether the character string of the form 11 converted by the data converter 21 is a fixed character or an input character.

図４の説明に戻る。上記のステップＳ１，Ｓ２の処理は、その処理順序が逆であってもよいし、同時であってもよい。 Returning to the description of FIG. The processing order of the above steps S1 and S2 may be reversed or simultaneous.

［ステップＳ３］次に文字補正部２３は、データ変換部２１による文字列の誤変換を検出し、補正を行う。文字補正部２３は、次の２つの方法によって、誤変換された文字列の検出および補正を行うことができる。文字補正部２３は、方法１だけで誤変換された文字列の検出および補正を行ってもよいが、補足データ１２が入力された場合には、方法２によって、誤変換された文字列の検出および補正を行ってもよい。また、文字補正部２３は、補足データ１２が入力された場合には、方法１と方法２によって、誤変換された文字列の検出および補正を行ってもよい。 [Step S3] Next, the character correction unit 23 detects an erroneous conversion of the character string by the data conversion unit 21 and corrects it. The character correction unit 23 can detect and correct the erroneously converted character string by the following two methods. The character correction unit 23 may detect and correct a character string erroneously converted only by the method 1. However, when the supplementary data 12 is input, the character correction unit 23 detects the erroneously converted character string by the method 2. And correction may be performed. In addition, when the supplementary data 12 is input, the character correction unit 23 may detect and correct the erroneously converted character string by the method 1 and the method 2.

１．複数の帳票による補正 1. Correction by multiple forms

文字補正部２３は、データ変換部２１によって生成された複数の帳票１１の文字列処理ＴＢ４１を参照し、データ変換部２１の誤変換を補正する。 The character correction unit 23 refers to the character string processing TB 41 of the plurality of forms 11 generated by the data conversion unit 21 and corrects erroneous conversion of the data conversion unit 21.

例えば、データ変換部２１は、上記したように、複数の同種の帳票１１をデータ変換し、図６に示したような、文字列処理ＴＢ４１を複数生成する。文字補正部２３は、複数生成された文字列処理ＴＢ４１を比較し、例えば、確率論やルール手法に基づいた方法で、最も正しいと思われる文字列に補正する。 For example, as described above, the data conversion unit 21 converts a plurality of the same type of forms 11 and generates a plurality of character string processes TB41 as shown in FIG. The character correction unit 23 compares a plurality of generated character string processes TB41 and corrects them to a character string that seems to be the most correct, for example, by a method based on probability theory or rule method.

具体的には、Ａ，Ｂ，Ｃ，Ｄ，Ｅ，Ｆの６個の文字列処理ＴＢ４１が生成されている場合を考える。これら６個の文字列処理ＴＢ４１のうち、例えば、Ａ，Ｄ，Ｆに含まれる文字列が「ｄａｔｅ」であったとする。また、これら６個の文字列処理ＴＢ４１のうち、Ｂ，Ｃに含まれる文字列が「ｄｅｔｅ」であったとする。また、これら６個の文字列処理ＴＢ４１のうち、Ｅに含まれる文字列が「ｄｏｔｅ」であったとする。この場合、「ｄａｔｅ」の出現確率が３／６で最も高い。従って、この場合、文字補正部２３は、「ｄｅｔｅ」と「ｄｏｔｅ」の文字列を「ｄａｔｅ」に補正する。 Specifically, consider a case where six character string processes TB41 of A, B, C, D, E, and F are generated. Of these six character string processes TB41, for example, a character string included in A, D, and F is “date”. Further, it is assumed that the character string included in B and C among these six character string processes TB41 is “dete”. Further, it is assumed that the character string included in E among these six character string processes TB41 is “dot”. In this case, the appearance probability of “date” is the highest at 3/6. Accordingly, in this case, the character correction unit 23 corrects the character strings “dete” and “date” to “date”.

または、文字補正部２３は、データ変換部２１のＯＣＲ機能による文字識別の確度を集計して補正してもよい。例えば、Ａにおける「ｄａｔｅ」の確度が９０％、Ｂにおける「ｄｅｔｅ」の確度が７０％である場合、「ｄａｔｅ」を正しい文字として補正するようにしてもよい。 Alternatively, the character correction unit 23 may collect and correct the accuracy of character identification by the OCR function of the data conversion unit 21. For example, when the accuracy of “date” in A is 90% and the accuracy of “dete” in B is 70%, “date” may be corrected as a correct character.

２．補足データによる補正 2. Correction by supplementary data

文字補正部２３は、補足データ１２が入力された場合、入力された補足データ１２を用いてデータ変換部２１の誤変換を補正することができる。 When the supplementary data 12 is input, the character correction unit 23 can correct erroneous conversion of the data conversion unit 21 using the input supplemental data 12.

例えば、補足データ１２は、帳票１１の文字列が固定文字であるか、記入文字であるかをユーザが直接指定したものである。従って、文字補正部２３は、補足データ１２とデータ変換された帳票１１の文字列を比較すれば、誤変換された帳票１１の文字列を検出して補正することができる。 For example, in the supplementary data 12, the user directly specifies whether the character string of the form 11 is a fixed character or an entry character. Therefore, the character correction unit 23 can detect and correct the character string of the form 11 that has been erroneously converted by comparing the supplementary data 12 and the character string of the form 11 that has been converted.

具体的には、文字補正部２３は、図９で説明した文字列処理ＴＢ４３（補足データ１２が付加された文字列処理ＴＢ）を参照して、データ変換部２１の誤変換を検出し、補正する。例えば、図９の例の場合、「変換結果」の欄の「文字列」の欄に示すように、「申請名称」が「甲請名称」と誤変換されているとする。この場合、文字補正部２３は、文字列処理ＴＢ４３の「補足データ情報」の欄の「文字列」の欄に記載されている「申請名称」と比較して誤変換を検出でき、「甲」という文字を正しい「申」の文字に補正することができる。 Specifically, the character correction unit 23 detects the erroneous conversion of the data conversion unit 21 with reference to the character string process TB 43 (character string process TB to which the supplementary data 12 is added) described in FIG. To do. For example, in the case of the example in FIG. 9, it is assumed that “application name” is erroneously converted to “subject name” as shown in the “character string” column of the “conversion result” column. In this case, the character correction unit 23 can detect an erroneous conversion in comparison with the “application name” described in the “character string” column of the “supplemental data information” column of the character string processing TB 43. Can be corrected to the correct “Sen” character.

図４の説明に戻る。 Returning to the description of FIG.

［ステップＳ４］次に文字識別部２４は、帳票１１に予め記載されている固定文字と、ユーザが帳票１１に記入した記入文字とを識別する。文字識別部２４は、次の３つの方法を用いて、固定文字であるか記入文字であるかの識別を行う。 [Step S <b> 4] Next, the character identifying unit 24 identifies a fixed character previously written in the form 11 and a character entered by the user in the form 11. The character identification unit 24 identifies whether the character is a fixed character or an entry character using the following three methods.

１．複数の帳票データによる識別 1. Identification with multiple forms

一般的に、同種の帳票１１においては、見出しや注意書きなどの固定文字は、別の帳票１１であっても同じ位置に同じ文字列で記載される。一方、ユーザが記入する記入文字は、同種の帳票１１であっても、別の帳票１１であれば、文字列が異なっている場合がある。 In general, in the same type of form 11, fixed characters such as a headline and a note are written in the same position at the same position even in another form 11. On the other hand, even if the entry character entered by the user is the same form 11, the character string may be different if it is another form 11.

そこで、文字識別部２４は、複数の帳票１１間での文字列の差異を判定し、差異が生じていないものを固定文字と識別し、差異が生じているものを記入文字と識別する。なお、データ変換部２１によってデータ変換される帳票の数が多ければ多いほど、文字識別部２４の識別精度は高くなる。これは、データ変換された帳票１１の数が少ない場合、ユーザの記入内容が偶然同じ文字列になる場合もあるからである。 Therefore, the character identification unit 24 determines the difference between the character strings among the plurality of forms 11, identifies those that have no difference as fixed characters, and identifies those that have a difference as entry characters. Note that the greater the number of forms whose data is converted by the data conversion unit 21, the higher the identification accuracy of the character identification unit 24. This is because if the number of forms 11 that have undergone data conversion is small, the user entry may happen to be the same character string.

図１０は、複数の帳票データによる文字識別を説明する図である。図１０には、文字列処理ＴＢ４４ａ，４４ｂ，４４ｃが示してある。また、図１０には、固定文字および記入文字を識別する際に生成される文字列比較ＴＢ４５ａ，４５ｂが示してある。また、図１０には、文字列比較ＴＢ４５ａ，４５ｂに基づいて生成される、固定文字および記入文字の識別結果を示した文字列処理ＴＢ４６が示してある。 FIG. 10 is a diagram for explaining character identification using a plurality of form data. FIG. 10 shows character string processing TBs 44a, 44b, and 44c. FIG. 10 shows character string comparison TBs 45a and 45b generated when identifying fixed characters and entry characters. FIG. 10 shows a character string process TB46 showing the identification result of the fixed character and the entry character generated based on the character string comparison TBs 45a and 45b.

データ変換部２１は、例えば、Ａ，Ｂ，Ｃの３つの帳票１１をデータ変換し、文字列処理ＴＢ４４ａ，４４ｂ，４４ｃを生成したとする。図１０に示すように、文字列処理ＴＢ４４ａの「Ｎｏ．１」の欄には、「申請名称」の文字列が格納され、Ｎｏ．２の欄には、「旅費申請」の文字列が格納されている。 For example, it is assumed that the data conversion unit 21 performs data conversion on three forms A, B, and C, and generates character string processes TB 44a, 44b, and 44c. As shown in FIG. 10, the character string “application name” is stored in the “No. 1” column of the character string processing TB 44 a. In the column 2, a character string “travel expense application” is stored.

文字列処理ＴＢ４４ｂの「Ｎｏ．１」の欄には、「申請名称」の文字列が格納され、「Ｎｏ．２」の欄には、「備品購入申請」の文字列が格納されている。 In the “No. 1” column of the character string processing TB 44 b, the “application name” character string is stored, and in the “No. 2” column, the “equipment purchase application” character string is stored.

文字列処理ＴＢ４４ｃの「Ｎｏ．１」の欄には、「申請名称」の文字列が格納され、「Ｎｏ．２」の欄には、「備品購入申請」の文字列が格納されている。 In the “No. 1” column of the character string processing TB 44 c, the “application name” character string is stored, and in the “No. 2” column, the “equipment purchase application” character string is stored.

文字識別部２４は、文字列処理ＴＢ４４ａ〜４４ｃの「Ｎｏ．」ごとに文字列とその位置とを収集する。そして、文字識別部２４は、文字列比較ＴＢ４５ａ，４５ｂを生成する。 The character identification unit 24 collects a character string and its position for each “No.” in the character string processing TBs 44a to 44c. Then, the character identifying unit 24 generates character string comparison TBs 45a and 45b.

例えば、文字識別部２４は、文字列処理ＴＢ４４ａ〜４４ｃの「Ｎｏ．１」の欄の文字列「申請名称」と、その位置とを取得し、文字列比較ＴＢ４５ａを生成する。ここで、文字列処理ＴＢ４４ａ〜４４ｃの「Ｎｏ．１」の欄の文字列「申請名称」は、文字列処理ＴＢ４４ａ〜４４ｃ間で差異がない。つまり、「申請名称」という文字列は、Ａ〜Ｃの複数の帳票１１間で差異がない。従って、文字識別部２４は、「申請名称」という文字列を固定文字と識別でき、文字列比較ＴＢ４５ａの判定結果の欄に「固定文字」を格納する。 For example, the character identifying unit 24 acquires the character string “application name” in the “No. 1” field of the character string processing TBs 44 a to 44 c and the position thereof, and generates a character string comparison TB 45 a. Here, the character string “application name” in the “No. 1” column of the character string processes TB44a to 44c is not different between the character string processes TB44a to 44c. That is, the character string “application name” has no difference among the plurality of forms 11 of A to C. Therefore, the character identification unit 24 can identify the character string “application name” as a fixed character, and stores “fixed character” in the determination result column of the character string comparison TB 45a.

また、文字識別部２４は、文字列処理ＴＢ４４ａ〜４４ｃの「Ｎｏ．２」の欄の文字列と、その位置とを取得し、文字列比較ＴＢ４５ｂを生成する。ここで、文字列処理ＴＢ４４ａ〜４４ｃの「Ｎｏ．２」の欄の文字列は、文字列処理ＴＢ４４ａ〜４４ｃ間で差異がある。つまり、「Ｎｏ．２」の欄の文字列は、Ａ〜Ｃの複数の帳票１１間で差異がある。従って、文字識別部２４は、「旅費申請」および「備品購入申請」という文字列を記入文字と識別でき、文字列比較ＴＢ４５ｂの判定結果の欄に「記入文字」を格納する。 In addition, the character identification unit 24 acquires the character string in the “No. 2” column of the character string processing TBs 44a to 44c and the position thereof, and generates a character string comparison TB 45b. Here, the character string in the column “No. 2” of the character string processing TBs 44a to 44c is different between the character string processing TBs 44a to 44c. That is, the character string in the column “No. 2” is different among the plurality of forms 11 of A to C. Therefore, the character identification unit 24 can identify the character strings “travel application” and “equipment purchase application” as input characters, and stores “entry characters” in the determination result column of the character string comparison TB 45 b.

文字識別部２４は、文字列処理ＴＢ４４ａ〜４４ｃの「Ｎｏ．」の欄に格納されている文字列ごとにおいて、文字列比較ＴＢ４５ａ，４５ｂを生成すると、生成した文字列比較ＴＢ４５ａ，４５ｂを用いて文字列処理ＴＢ４６を生成する。例えば、文字識別部２４は、各領域における文字列の固定文字の確率がいくらであるかの文字列処理ＴＢ４６を生成する。 When the character identification unit 24 generates the character string comparison TBs 45a and 45b for each character string stored in the “No.” column of the character string processing TBs 44a to 44c, the character string comparison TBs 45a and 45b are used. A character string process TB46 is generated. For example, the character identification unit 24 generates a character string process TB 46 indicating the probability of a fixed character of the character string in each region.

例えば、図１０に示す文字列処理ＴＢ４６の「文字列」の欄には、帳票１１の領域ごとの文字列が格納される。「位置」の欄には、帳票１１の文字列が記載されている位置が格納される。「判定結果」の欄には、「文字列」の欄に格納されている文字列の固定文字である確率が格納される。 For example, a character string for each area of the form 11 is stored in the “character string” field of the character string processing TB 46 shown in FIG. In the “position” column, the position where the character string of the form 11 is written is stored. In the “judgment result” column, the probability of being a fixed character of the character string stored in the “character string” column is stored.

なお、データ変換部２１による誤変換や帳票１１ごとの文字列のゆらぎによって、固定文字であるにもかかわらず、帳票１１間で文字列が異なる場合がある。この場合、文字識別部２４は、文字形状や単語意味の類似性を数値で評価し、閾値により幅を持たせて認識するようにしてもよい。 Note that due to erroneous conversion by the data conversion unit 21 and fluctuation of the character string for each form 11, the character strings may differ between the forms 11 even though they are fixed characters. In this case, the character identification unit 24 may evaluate the similarity of the character shape and the word meaning with a numerical value, and recognize it with a width based on a threshold value.

また、複数の帳票１１における文字列の比較は、同一の項目において比較する必要がある。同一の項目における比較は、例えば、複数の帳票１１に記載されている文字列の位置を基準に行うことができる。その際、複数の帳票１１間における文字列の位置の微細なずれや誤差が問題となるが、誤差範囲に関する閾値を設けることで、対応可能となる。 Further, it is necessary to compare character strings in a plurality of forms 11 in the same item. The comparison in the same item can be performed on the basis of the position of the character string described in the some form 11, for example. At that time, a minute shift or error in the position of the character string between the plurality of forms 11 becomes a problem, but it can be dealt with by providing a threshold regarding the error range.

また、文字識別部２４は、その他の比較対象の識別方法として、帳票１１の罫線等の構造特徴から、比較する文字列対象を認識するようにしてもよい。 In addition, the character identification unit 24 may recognize a character string object to be compared from a structural feature such as a ruled line of the form 11 as another method for identifying a comparison object.

２．学習による識別 2. Identification by learning

固定文字と記入文字との間には、利用する単語や文字数、記載される領域の大きさなどに関し、特徴的な差異が生じる。そこで、文字識別部２４は、固定文字および記入文字であることを評価する情報を蓄積し、識別を行う。 There is a characteristic difference between the fixed character and the written character with respect to the word used, the number of characters, the size of the area to be described, and the like. Therefore, the character identification unit 24 accumulates information that evaluates whether the character is a fixed character or a written character, and performs identification.

例えば、文字識別部２４は、上記で説明した「１．複数の帳票データによる識別」の判定結果から、固定文字についての学習情報ＴＢを生成する。 For example, the character identification unit 24 generates learning information TB for a fixed character from the determination result of “1. Identification by a plurality of form data” described above.

図１１は、学習情報ＴＢの一例を示した図である。図１１に示すように、学習情報ＴＢ４７の「Ｎｏ．」の欄には、学習した文字列に付与した番号が格納される。 FIG. 11 is a diagram illustrating an example of the learning information TB. As shown in FIG. 11, the number assigned to the learned character string is stored in the “No.” field of the learning information TB47.

「文字列」の欄には、学習した文字列が格納される。例えば、「文字列」の欄には、上記で説明した「１．複数の帳票データによる識別」によって固定文字であるか否かの判定が行われた文字列が格納される。例えば、図１１に示す「申請名称」や「申請者」は、過去において、上記の「１．複数の帳票データによる識別」によって固定文字であるか否かの判断が行われていることが分かる。 In the “character string” column, the learned character string is stored. For example, in the “character string” column, a character string that has been determined whether or not it is a fixed character by “1. Identification by a plurality of form data” described above is stored. For example, it can be seen that “application name” and “applicant” shown in FIG. 11 have been determined whether or not they are fixed characters in the past by “1. Identification by a plurality of form data”. .

「固定文字確率」の欄には、「文字列」の欄に格納されている文字列が、固定文字である確率が格納される。例えば、「申請名称」という文字列は、過去に９１％の確率で固定文字と識別されていることが分かる。 The “fixed character probability” field stores the probability that the character string stored in the “character string” field is a fixed character. For example, it is understood that the character string “application name” has been identified as a fixed character with a probability of 91% in the past.

文字識別部２４は、生成した学習情報ＴＢ４７を参照し、データ変換部２１によってデータ変換された文字列の固定文字である確率を取得する。 The character identifying unit 24 refers to the generated learning information TB47, and acquires the probability that the character string converted by the data converting unit 21 is a fixed character.

図１２は、学習情報ＴＢに基づいた文字列処理ＴＢの一例を示した図である。文字識別部２４は、図１１で説明した学習情報ＴＢ４７を参照し、学習に基づく文字列処理ＴＢ４８を生成する。 FIG. 12 is a diagram illustrating an example of the character string process TB based on the learning information TB. The character identification unit 24 refers to the learning information TB47 described in FIG. 11 and generates a character string process TB48 based on learning.

文字列処理ＴＢ４８の「Ｎｏ．」の欄、「文字列」の欄、および「位置」の欄の情報は、図６で説明した文字列処理ＴＢ４１と同様に、データ変換部２１によって格納される。 The information in the “No.”, “character string”, and “position” fields of the character string processing TB48 is stored by the data conversion unit 21 as in the character string processing TB41 described with reference to FIG. .

「学習結果固定文字確率」の欄には、「文字列」の欄に対応する文字列の、固定文字である確率が格納される。 The “learning result fixed character probability” field stores the probability that the character string corresponding to the “character string” field is a fixed character.

例えば、文字識別部２４は、図１１で説明した学習情報ＴＢ４７を参照し、「申請名称」という文字列の固定文字である確率を取得する。図１１の例の場合、「申請名称」の文字列の固定文字である確率は、９１％である。文字識別部２４は、取得した確率を、図１２の学習結果固定文字確率の欄に示すように格納する。 For example, the character identification unit 24 refers to the learning information TB47 described with reference to FIG. 11 and acquires the probability that the character string “application name” is a fixed character. In the case of the example in FIG. 11, the probability that the character string “application name” is a fixed character is 91%. The character identification unit 24 stores the acquired probability as shown in the column of learning result fixed character probability in FIG.

このように、文字識別部２４は、過去の文字列の学習によって、文字列が固定文字であるか記入文字であるか識別する。 As described above, the character identifying unit 24 identifies whether the character string is a fixed character or an input character by learning the past character string.

３．補足データによる識別 3. Identification with supplementary data

文字識別部２４は、補足データ１２が入力された場合、その補足データを用いて固定文字であるか記入文字であるかの識別を行う。 When the supplementary data 12 is input, the character identification unit 24 identifies whether the supplementary data 12 is a fixed character or a written character.

例えば、補足データ１２は、帳票１１の文字列が固定文字であるか、記入文字であるかをユーザが直接指定したものである。従って、補足データ１２とデータ変換された帳票１１の文字列を比較すれば、データ変換された文字列が固定文字であるか記入文字であるか識別できる。 For example, in the supplementary data 12, the user directly specifies whether the character string of the form 11 is a fixed character or an entry character. Therefore, by comparing the supplementary data 12 and the character string of the data-converted form 11, it is possible to identify whether the data-converted character string is a fixed character or an input character.

具体的には、文字識別部２４は、データ変換された文字列と、補足データ１２で記述された項目（固定文字）とが一致した場合、データ変換されたその文字列を、１００％固定文字であると識別する。また、文字識別部２４は、データ変換された文字列と、補足データ１２で記述された項目内容（記入文字）とが一致した場合、データ変換されたその文字列を、０％固定文字であると識別する。 Specifically, when the character string converted into data matches the item (fixed character) described in the supplementary data 12, the character identification unit 24 converts the converted character string into a 100% fixed character. Is identified. In addition, when the character string after data conversion matches the item content (entry character) described in the supplementary data 12, the character identification unit 24 uses the data converted character string as a 0% fixed character. Identify.

文字識別部２４は、上記の３つの方法によって、固定文字であるか記入文字であるかの識別を行うとともに、それらの識別結果を組み合わせて、最終的な識別を行う。例えば、文字識別部２４は、上記３つの方法で識別（算出）した固定文字の確率を平均化する。 The character identification unit 24 identifies whether the character is a fixed character or an input character by the above three methods, and combines the identification results to perform final identification. For example, the character identification unit 24 averages the probabilities of fixed characters identified (calculated) by the above three methods.

図１３は、文字列処理ＴＢの一例を示した図である。文字識別部２４は、上記の３つの方法の識別結果を組み合わせ、図１３に示す文字列処理ＴＢ４９を生成する。 FIG. 13 is a diagram illustrating an example of the character string process TB. The character identification unit 24 combines the identification results of the above three methods to generate a character string process TB49 shown in FIG.

文字列処理ＴＢ４９の「Ｎｏ．」の欄、「文字列」の欄、および「位置」の欄の情報は、データ変換部２１によってデータ変換されたデータが格納される。 The data converted by the data converter 21 is stored in the “No.”, “character string”, and “position” fields of the character string processing TB49.

「固定文字確率」の欄は、「複数画像」の欄、「補足データ」の欄、「学習」の欄、および「平均」の欄を有している。 The “fixed character probability” column has a “multiple images” column, a “supplemental data” column, a “learning” column, and an “average” column.

「複数画像」の欄には、上記の「１．複数の帳票データによる識別」で算出した、固定文字確率が格納される。例えば、文字識別部２４は、図１０で説明したように、複数の帳票１１から、文字列処理ＴＢ４６を生成する。文字識別部２４は、生成した文字列処理ＴＢ４６の「判定結果」を、「複数画像」の欄に格納する。なお、図１０と図１３のＴＢ内容は一致していない。 In the “multiple images” column, the fixed character probabilities calculated in “1. Identification by a plurality of form data” are stored. For example, as described with reference to FIG. 10, the character identification unit 24 generates a character string process TB 46 from a plurality of forms 11. The character identification unit 24 stores the “determination result” of the generated character string processing TB 46 in the “multiple images” field. Note that the TB contents in FIG. 10 and FIG. 13 do not match.

「補足データ」の欄には、上記の「３．補足データによる識別」で算出した、固定文字確率が格納される。例えば、文字列「申請名称」が補足データ１２に項目として記述されている場合、文字識別部２４は、文字列「申請名称」を１００％固定文字であると識別する。そして、文字識別部２４は、識別した固定文字確率１００％を「補足データ」の欄に格納する。また、例えば、文字列「○○県視察交通費申請」が補足データ１２に項目内容として記述されている場合、文字識別部２４は、文字列「○○県視察交通費申請」を０％固定文字であると識別する。そして、文字識別部２４は、識別した固定文字確率０％を「補足データ」の欄に格納する。 In the “supplemental data” column, the fixed character probabilities calculated in “3. Identification by supplementary data” are stored. For example, when the character string “application name” is described as an item in the supplementary data 12, the character identifying unit 24 identifies the character string “application name” as a 100% fixed character. The character identifying unit 24 stores the identified fixed character probability 100% in the “supplemental data” field. Further, for example, when the character string “XX prefecture inspection transportation expense application” is described as the item content in the supplementary data 12, the character identification unit 24 fixes the character string “XX prefecture inspection transportation expense application” at 0%. Identifies it as a character. Then, the character identification unit 24 stores the identified fixed character probability 0% in the “supplemental data” field.

「学習」の欄には、上記の「２．学習による識別」で算出した、固定文字確率が格納される。例えば、文字識別部２４は、図１１および図１２で説明したように、学習情報ＴＢ４７から文字列処理ＴＢ４８を生成する。文字識別部２４は、生成した文字列処理ＴＢ４８の「学習結果固定文字確率」を、「学習」の欄に格納する。 In the “learning” column, the fixed character probability calculated in “2. Identification by learning” is stored. For example, as described with reference to FIGS. 11 and 12, the character identification unit 24 generates the character string process TB48 from the learning information TB47. The character identifying unit 24 stores the “learning result fixed character probability” of the generated character string processing TB48 in the “learning” column.

文字識別部２４は、「複数画像」の欄、「補足データ」の欄、および「学習」の欄に格納されている固定文字確率の平均を算出し、「最終結果」の欄に格納する、例えば、文字列「申請名称」の複数画像による固定文字確率は１００％であり、補足データによる固定文字確率は１００％であり、学習による固定文字確率は９１％であるので、「最終結果」の欄には、９７％が格納される。 The character identification unit 24 calculates an average of fixed character probabilities stored in the “multiple images” column, the “supplemental data” column, and the “learning” column, and stores the average in the “final result” column. For example, the fixed character probability of a plurality of images of the character string “application name” is 100%, the fixed character probability of supplementary data is 100%, and the fixed character probability of learning is 91%. In the column, 97% is stored.

「判定結果」の欄には、「文字列」の欄に格納されている文字列が、最終的に固定文字であるか否かの結果が格納される。文字識別部２４は、例えば、「最終結果」の欄に格納されている確率と、所定の閾値とを比較して、「文字列」の欄に格納されている文字列が、最終的に固定文字であるか否かの判定を行う。例えば、文字識別部２４は、「平均」の欄に格納されている確率が８０％以上の場合、その文字列を固定文字と判断する。 In the “judgment result” column, a result indicating whether or not the character string stored in the “character string” column is finally a fixed character is stored. For example, the character identification unit 24 compares the probability stored in the “final result” column with a predetermined threshold, and the character string stored in the “character string” column is finally fixed. Judge whether it is a character or not. For example, when the probability stored in the “average” column is 80% or more, the character identification unit 24 determines that the character string is a fixed character.

なお、上記では、文字識別部２４は、３つの方法で識別した結果に基づいて、最終的な固定文字の認識（判定）を行っているが、１．の方法と２．の方法とで識別した結果に基づいて、最終的な固定文字の認識を行ってもよい。または、文字識別部２４は、１．の方法と３．の方法とで識別した結果に基づいて、最終的な固定文字の認識を行ってもよい。 In the above description, the character identification unit 24 performs final recognition (determination) of fixed characters based on the results of identification by three methods. And 2. The final fixed character may be recognized based on the result identified by the above method. Alternatively, the character identification unit 24 is 1. And 3. The final fixed character may be recognized based on the result identified by the above method.

また、文字識別部２４は、上記の３つの方法で識別した結果に重みづけをして、固定文字の判定を行ってもよい。例えば、補足データ１２は、ユーザが生成するものであり、誤りが含まれている場合もある。この場合、補足データ１２による識別結果の重みを小さくするようにしてもよい。 In addition, the character identification unit 24 may determine the fixed character by weighting the result identified by the above three methods. For example, the supplementary data 12 is generated by the user and may contain errors. In this case, you may make it make the weight of the identification result by the supplementary data 12 small.

図４の説明に戻る。 Returning to the description of FIG.

［ステップＳ５］次に領域識別部２５は、文字識別部２４によって識別された固定文字と記入文字とに基づいて、ユーザが帳票１１に文字を記入できる記入領域と、ユーザが帳票１１に文字を記入できない非記入領域とを識別する。領域識別部２５は、以下で説明する判定処理ＴＢを生成して、前記の記入領域と非記入領域とを識別する。 [Step S5] Next, the area identifying unit 25, based on the fixed character and the entered character identified by the character identifying unit 24, the entry area in which the user can write a character on the form 11, and the user placing a character on the form 11. Identify non-fillable areas that cannot be filled. The area identifying unit 25 generates a determination process TB described below, and identifies the entry area and the non-entry area.

図１４は、判定処理ＴＢの一例を示した図である。領域識別部２５は、図６で説明した文字列処理ＴＢ４１と、図７で説明した領域処理ＴＢ４２とを組み合わせて、図１４に示す判定処理ＴＢ５０を生成する。また、領域識別部２５は、生成した判定処理ＴＢ５０に、図１３で説明した文字列処理ＴＢ４９の判定結果を、判定処理ＴＢ５０に格納する。 FIG. 14 is a diagram illustrating an example of the determination process TB. The area identifying unit 25 generates the determination process TB50 shown in FIG. 14 by combining the character string process TB41 described in FIG. 6 and the area process TB42 described in FIG. Further, the area identifying unit 25 stores the determination result of the character string process TB49 described in FIG. 13 in the generated determination process TB50 in the determination process TB50.

例えば、図１４に示す判定処理ＴＢ５０の「領域処理テーブル」の欄は、図７に示した領域処理ＴＢ４２に対応している。また、判定処理ＴＢ５０の「処理テーブル」の欄にある「Ｎｏ．」の欄および「文字列」の欄は、例えば、図６に示した文字列処理ＴＢ４１の「Ｎｏ．」の欄および「文字列」の欄に対応している。 For example, the “region processing table” column of the determination processing TB50 shown in FIG. 14 corresponds to the region processing TB42 shown in FIG. Further, the “No.” column and the “character string” column in the “processing table” column of the determination process TB50 are, for example, the “No.” column and the “character” of the character string process TB41 shown in FIG. Corresponds to the column.

判定処理ＴＢ５０の「判定結果」の欄は、図１３で説明した文字列処理ＴＢ４９の「判定結果」の欄に基づいて、情報が書き込まれる。 Information is written in the “determination result” column of the determination process TB50 based on the “determination result” column of the character string process TB49 described in FIG.

例えば、文字列処理ＴＢ４９の「判定結果」が「固定文字」である場合、「判定結果」の欄には、「非記入」が格納される。また、文字列処理ＴＢ４９の「判定結果」が「記入文字」である場合、「判定結果」の欄には、「記入」が格納される。具体的には、図１４の「申請名称」という文字列は、図１３の文字列処理ＴＢ４９より、固定文字と判断されているので、図１４の「申請名称」に対応する「判定結果」の欄には、「非記入」が格納されている。 For example, when the “determination result” of the character string processing TB49 is “fixed character”, “not filled” is stored in the “determination result” column. When the “determination result” of the character string process TB49 is “entry character”, “entry” is stored in the “determination result” column. Specifically, since the character string “application name” in FIG. 14 is determined as a fixed character by the character string processing TB49 in FIG. 13, the “determination result” corresponding to the “application name” in FIG. In the column, “not filled” is stored.

なお、「判定結果」の欄の「非記入」は、対応する文字列の領域が固定文字の領域であることを示す。「判定結果」の欄の「記入」は、対応する文字列の領域が記入文字の領域であることを示す。 “Non-entry” in the “judgment result” column indicates that the corresponding character string area is a fixed character area. “Entry” in the “judgment result” field indicates that the area of the corresponding character string is an entry character area.

また、「判定結果」の欄の「非記入／記入」は、対応する文字列の領域が固定文字の領域であるとともに、記入文字の領域であることを示す。すなわち、「非記入／記入」は、記入領域かつ非記入領域であることを示す。例えば、図１４に示す文字列「理由：○○県視察のため」の領域は、記入領域であり、かつ非記入領域であることが分かる（「理由：」は固定文字であり、「○○県視察のため」は記入文字である。）。 “Non-entry / entry” in the “judgment result” column indicates that the corresponding character string area is a fixed character area and an entered character area. In other words, “not filled / filled” indicates that it is a filled area and a non-filled area. For example, it can be seen that the character string “reason: for XX prefecture visit” shown in FIG. 14 is an entry area and a non-entry area (“reason:” is a fixed character, and “XX” "For prefectural inspection" is a written character.)

記入領域と非記入領域が混在するのは、データ変換部２１によるグルーピングの際、固定文字と記入文字とが１つのグループとして認識される場合があるからである。例えば、図５において、「理由： ○○県視察のため」の文字列は、罫線枠外の空白領域に記載されており、データ変換部２１は、罫線枠外の空白領域を１つのグループとしてグルーピングする場合があるからである。 The reason why the entry area and the non-entry area are mixed is that when the data conversion unit 21 performs grouping, the fixed character and the entry character may be recognized as one group. For example, in FIG. 5, the character string “reason: for XX prefecture visit” is described in a blank area outside the ruled line frame, and the data conversion unit 21 groups the blank area outside the ruled line frame as one group. Because there are cases.

「分割要否」の欄には、記入領域と非記入領域とを含むグルーピングを分割するか否かの情報が格納される。例えば、「理由：○○県視察のため」の領域は、記入領域かつ非記入領域であり、１つのグループにグルーピングされている。この場合、領域識別部２５は、「分割要否」の欄に、グルーピングを分割する必要があることを示す「要」という情報を格納する。 Information on whether or not to divide the grouping including the entry area and the non-entry area is stored in the “necessity of division” column. For example, the area of “reason: for visiting XX prefecture” is an entry area and a non-entry area, and is grouped into one group. In this case, the area identifying unit 25 stores information “necessary” indicating that it is necessary to divide the grouping in the “necessity of division” column.

図４の説明に戻る。 Returning to the description of FIG.

［ステップＳ６］次にグルーピング補正部２６は、データ変換部２１が行ったグルーピングを補正する。グルーピング補正部２６は、次の３つの方法によってグルーピング補正を行う。グルーピング補正部２６は、方法１〜３のすべてまたはいずれかを用いて、グルーピング補正を行う。 [Step S6] Next, the grouping correction unit 26 corrects the grouping performed by the data conversion unit 21. The grouping correction unit 26 performs grouping correction by the following three methods. The grouping correction unit 26 performs grouping correction using all or any of the methods 1 to 3.

１．判定処理ＴＢ５０を用いた補正 1. Correction using determination processing TB50

グルーピング補正部２６は、図１４で説明した判定処理ＴＢ５０を参照して、グルーピング補正をする。 The grouping correction unit 26 performs grouping correction with reference to the determination process TB50 described in FIG.

図１５は、グルーピング補正後の判定処理ＴＢの一例を示した図である。図１５に示す判定処理ＴＢ５１は、図１４で説明した判定処理ＴＢ５１とほぼ同様であるが、図１５に示す判定処理ＴＢ５０は、「分割要否」の欄を有していない。 FIG. 15 is a diagram illustrating an example of the determination process TB after the grouping correction. The determination process TB51 illustrated in FIG. 15 is substantially the same as the determination process TB51 described with reference to FIG. 14, but the determination process TB50 illustrated in FIG.

グルーピング補正部２６は、図１４で説明した判定処理ＴＢ５０の「分割要否」の欄を参照する。グルーピング補正部２６は、「分割要否」の欄に「要」の情報が格納されている場合、その欄に対応する文字列を別のグループとなるようにグルーピング補正する。 The grouping correction unit 26 refers to the “division necessity” column of the determination process TB50 described in FIG. When the “necessary” information is stored in the “necessity of division” column, the grouping correction unit 26 corrects the grouping so that the character string corresponding to the column is in another group.

例えば、図１４の判定処理ＴＢ５０の場合、「“理由：”，“○○県視察のため”」の文字列は、分割要否「要」となっている。そこで、グルーピング補正部２６は、「“理由：”，“○○県視察のため”」の文字列が別々のグループとなるようにグルーピング補正する。具体的には、図１５の判定処理ＴＢ５１に示すように、「理由：」の文字列と、「○○県視察のため」の文字列とが別々のグループとなるようにグルーピング補正する。そして、グルーピング補正部２６は、「判定結果」の欄を適切な情報となるように補正する。 For example, in the case of the determination process TB50 of FIG. 14, the character string "" reason: "," for visiting XX prefecture "" indicates whether or not the division is necessary. Therefore, the grouping correction unit 26 corrects the grouping so that the character strings ““ reason: ”and“ for XX prefecture inspection ”” are in separate groups. Specifically, as shown in the determination process TB51 of FIG. 15, the grouping correction is performed so that the character string of “reason:” and the character string of “for XX prefecture inspection” are in separate groups. Then, the grouping correction unit 26 corrects the “determination result” field to be appropriate information.

２．複数画像による補正 2. Multiple image correction

罫線がない領域で、かつ可変長の文字列が存在するような領域では、その領域の終了位置を適切に判定することが困難である。しかし、グルーピング補正部２６は、複数の帳票１１における領域処理ＴＢ４２によって、適切な終了位置を検出することができる。 In an area where there is no ruled line and where a variable-length character string exists, it is difficult to appropriately determine the end position of the area. However, the grouping correction unit 26 can detect an appropriate end position by the area processing TB 42 in the plurality of forms 11.

図１６は、可変長領域を説明する図である。図１６に示す帳票５２は、図５に示した帳票１１とほぼ同様であるが、罫線枠外の「理由：」の欄に記載されている文字列が異なる。例えば、図５の帳票１１では、「理由：」の欄の文字列は１行であるが、図１６の帳票５２では、「理由：」の欄の文字列は３行となっている。 FIG. 16 is a diagram for explaining the variable length region. The form 52 shown in FIG. 16 is almost the same as the form 11 shown in FIG. 5, but the character string described in the “reason:” column outside the ruled line frame is different. For example, in the form 11 in FIG. 5, the character string in the “reason:” column is one line, but in the form 52 in FIG. 16, the character string in the “reason:” column is three lines.

帳票５２の「理由：」は、固定文字である。その理由を示した３行の文字列は、ユーザが記入した記入文字である。本来、ユーザが文字列を記入できる記入文字の領域は、点５２ａを開始位置とし、点５２ｂを終了位置として認識されるのが適切である。しかし、帳票５２の例の場合、点５２ａを開始位置、点５２ｃを終了位置として認識される。 “Reason:” in the form 52 is a fixed character. The three-line character string indicating the reason is an input character entered by the user. Originally, it is appropriate that the entry character area in which the user can enter a character string is recognized with the point 52a as the start position and the point 52b as the end position. However, in the case of the form 52, the point 52a is recognized as the start position and the point 52c is recognized as the end position.

しかし、データ変換部２１は、複数の帳票１１をデータ変換し、複数の領域処理ＴＢ４２を生成する。そのため、様々なユーザの文字列記入データが得られ、例えば、点５２ｂまで文字列を記入したユーザも存在する場合もある。従って、グルーピング補正部２６は、もっとも値の大きい終了位置を採用することによって、可変長領域の適切な終了位置を取得することができる。すなわち、複数の帳票をデータ変換することによって、精度の高い領域判定を行うことができる。 However, the data conversion unit 21 converts a plurality of forms 11 and generates a plurality of region processes TB42. Therefore, character string entry data of various users can be obtained. For example, there may be a user who entered a character string up to the point 52b. Therefore, the grouping correction unit 26 can acquire an appropriate end position of the variable length region by adopting the end position having the largest value. That is, it is possible to perform highly accurate area determination by converting data of a plurality of forms.

図１７は、図１６の帳票の判定処理ＴＢの一例を示した図である。図１６の帳票５２をデータ変換し、文字識別し、領域識別し、およびグルーピング補正を行ったとする。そして、図１７に示す判定処理ＴＢ５３が得られたとする。 FIG. 17 is a diagram showing an example of the form determination process TB in FIG. Assume that the form 52 in FIG. 16 is converted into data, characters are identified, areas are identified, and grouping correction is performed. Assume that the determination process TB53 shown in FIG. 17 is obtained.

判定処理ＴＢ５３の「Ｎｏ．８」の終了位置は、（Ｙ，Ｘ）＝（９０，１２０）となっている。この終了位置は、図５に示した帳票１１の判定処理ＴＢ５１（図１５）の終了位置より、Ｙの値が大きい。従って、グルーピング補正部２６は、例えば、判定処理ＴＢ５１の「Ｎｏ．８」の終了位置の「Ｙ」の値を、「７０」から「９０」に補正することができる。 The end position of “No. 8” in the determination process TB53 is (Y, X) = (90, 120). This end position has a larger Y value than the end position of the determination process TB51 (FIG. 15) for the form 11 shown in FIG. Therefore, for example, the grouping correction unit 26 can correct the value of “Y” at the end position of “No. 8” of the determination process TB51 from “70” to “90”.

３．補足データによる補正 3. Correction by supplementary data

グルーピング補正部２６は、補足データ１２が入力された場合、入力された補足データ１２を用いてグルーピング補正をする。 When the supplementary data 12 is input, the grouping correction unit 26 performs grouping correction using the input supplemental data 12.

補足データ１２は、データ変換された帳票１１の固定文字および記入文字をユーザが指定したものである。従って、グルーピング補正部２６は、入力された補足データ１２を参照すれば、どの文字列を１つのグループにグルーピングすればよいか認識することができる。 The supplementary data 12 is obtained by designating fixed characters and entry characters of the form 11 after data conversion. Therefore, the grouping correction unit 26 can recognize which character strings should be grouped into one group by referring to the input supplemental data 12.

図１８は、誤ってグルーピングされた図１６の帳票の判定処理ＴＢの一例を示した図である。例えば、図１６に示した帳票５２をデータ変換し、文字識別し、および領域識別したとする。そして、図１８に示す判定処理ＴＢ５４が得られたとする。 FIG. 18 is a diagram illustrating an example of the determination processing TB for the form of FIG. 16 that has been grouped by mistake. For example, it is assumed that the form 52 shown in FIG. 16 is converted into data, characters are identified, and areas are identified. Assume that the determination process TB54 shown in FIG. 18 is obtained.

本来、文字列「以下、研修参加のため」、「・特許作成研修」、および「・マーケティング理論」は、１つのグループにグルーピングされなければならないが、データ変換部２１は、判定処理ＴＢ５４に示すように、それぞれの文字列を１つのグループとしてグルーピングしたとする。 Originally, the character strings “hereinafter, for participation in training”, “• patent creation training”, and “• marketing theory” must be grouped into one group, but the data conversion unit 21 indicates to the determination process TB54. Thus, it is assumed that each character string is grouped as one group.

ここで、補足データ１２は、ユーザによって作成される。図１６の帳票５２の例の場合、文字列「以下、研修参加のため」、「・特許作成研修」、および「・マーケティング理論」は、１つの記入領域に記載されているものであって、次のようなタグで記載される。 Here, the supplementary data 12 is created by the user. In the case of the form 52 of FIG. 16, the character strings “hereinafter, for training participation”, “• patent creation training”, and “• marketing theory” are described in one entry area, It is described with the following tag.

＜内容＞以下、研修参加のため￥ｎ・特許作成研修￥ｎ・マーケティング理論＜／内容＞ <Contents> In order to participate in the training, ¥ n ・ Patent creation training ¥ n ・ Marketing theory </ Contents>

グルーピング補正部２６は、上記のタグから「以下、研修参加のため」、「・特許作成研修」、および「・マーケティング理論」の文字列を１つのグループにグルーピングしなければならないことを認識できる。これにより、グルーピング補正部２６は、図１８に示す判定処理ＴＢ５３を適切なＴＢに補正することができる。 The grouping correction unit 26 can recognize from the above tags that the character strings of “for training participation”, “• patent creation training”, and “• marketing theory” must be grouped into one group. Thereby, the grouping correction | amendment part 26 can correct | amend the determination process TB53 shown in FIG. 18 to appropriate TB.

図１９は、グルーピング補正後の判定処理ＴＢの一例を示した図である。グルーピング補正部２６は、入力された補足データ１２に基づいて、図１９に示す判定処理ＴＢ５５のように、グルーピングを補正する。 FIG. 19 is a diagram illustrating an example of the determination process TB after the grouping correction. The grouping correction unit 26 corrects the grouping based on the input supplemental data 12 as in the determination process TB55 shown in FIG.

例えば、図１８の判定処理ＴＢ５４では、文字列「以下、研修参加のため」、「・特許作成研修」、および「・マーケティング理論」は、別々のグループにグルーピングされていた。これに対し、図１９の判定処理ＴＢ５５では、これらの文字は、グルーピング補正部２６によって、１つのグループに補正されている。 For example, in the determination process TB54 of FIG. 18, the character strings “hereinafter, for training participation”, “• patent creation training”, and “• marketing theory” are grouped into separate groups. On the other hand, in the determination process TB55 of FIG. 19, these characters are corrected into one group by the grouping correction unit 26.

図４の説明に戻る。 Returning to the description of FIG.

［ステップＳ７］次に生成部２７は、記憶部２８に生成された各種ＴＢを用いて、電子化帳票１３、帳票テンプレート１４、およびテンプレート情報１５を生成する。 [Step S <b> 7] Next, the generation unit 27 generates the electronic form 13, the form template 14, and the template information 15 by using various types of TB generated in the storage unit 28.

例えば、生成部２７は、データ変換部２１によってデータ変換されたデータから、電子化帳票１３を生成する。電子化帳票１３は、例えば、文書ファイルまたは画像ファイルによって生成される。 For example, the generation unit 27 generates the computerized form 13 from the data converted by the data conversion unit 21. The digitized form 13 is generated by, for example, a document file or an image file.

また、生成部２７は、データ変換部２１によってデータ変換されたデータから、例えば、帳票１１の文字や罫線を編集できる帳票テンプレート１４を生成する。その際、生成部２７は、判定結果ＴＢの判定結果（記入／非記入の結果）に基づいて、作成する帳票テンプレート１４の記入領域に存在する文字列を削除する。帳票テンプレート１４は、例えば、文字や罫線を編集できる文書ファイルによって生成される。 Further, the generation unit 27 generates a form template 14 that can edit, for example, characters and ruled lines of the form 11 from the data converted by the data conversion unit 21. At this time, the generation unit 27 deletes the character string existing in the entry area of the form template 14 to be created based on the determination result (entry / non-entry result) of the determination result TB. The form template 14 is generated by, for example, a document file that can edit characters and ruled lines.

また、生成部２７は、帳票テンプレート１４のテンプレート情報１５を生成する。テンプレート情報１５は、例えば、帳票テンプレート１４に記載されている文字列、位置、書式、ページ余白、用紙サイズ、または罫線書式などの書式情報を含んだファイルである。 Further, the generation unit 27 generates template information 15 of the form template 14. The template information 15 is a file including format information such as a character string, position, format, page margin, paper size, or ruled line format described in the form template 14.

図２０は、電子化帳票の例を示した図である。図２０の電子化帳票１３は、図１６の帳票５２を電子化した例を示している。図２０に示す点線枠５６は、ユーザが文字を記入できる記入領域を示している。記入領域においては、ユーザが文字を編集できるようにしてもよい。 FIG. 20 is a diagram showing an example of an electronic form. An electronic form 13 in FIG. 20 shows an example in which the form 52 in FIG. 16 is digitized. A dotted line frame 56 shown in FIG. 20 indicates an entry area in which a user can enter characters. In the entry area, the user may be able to edit the characters.

図２１は、帳票テンプレートの例を示した図である。図２１の帳票テンプレート１４は、図１６の帳票をテンプレート化した例を示している。図２０に示す点線枠５７は、ユーザが文字を記入できる記入領域を示している。ユーザは、点線枠５７においては、例えば、端末装置などを用いて文字を入力できる。 FIG. 21 is a diagram showing an example of a form template. The form template 14 in FIG. 21 shows an example in which the form in FIG. 16 is made into a template. A dotted line frame 57 shown in FIG. 20 indicates an entry area in which a user can enter characters. In the dotted line frame 57, the user can input characters using, for example, a terminal device.

図２２は、テンプレート情報の例を示した図である。図２２のテンプレート情報１５は、図２１の帳票テンプレート１４のテンプレート情報を示している。 FIG. 22 is a diagram showing an example of template information. The template information 15 in FIG. 22 indicates the template information of the form template 14 in FIG.

テンプレート情報１５には、例えば、固定文字、位置、文字書式（フォント種別、サイズ）、ページ余白、用紙サイズ、用紙向き、罫線位置、罫線書式、領域情報などが含まれる。 The template information 15 includes, for example, a fixed character, position, character format (font type, size), page margin, paper size, paper orientation, ruled line position, ruled line format, region information, and the like.

このように、情報処理装置１は、複数の帳票１１の電子データから、帳票１１に予め記載されている固定文字とユーザが記入した記入文字とを識別し、識別した固定文字と記入文字とに基づいて、ユーザが帳票１１に文字を記入できる記入領域とユーザが帳票１１に文字を記入できない非記入領域とを識別する。そして、情報処理装置１は、識別した記入領域と非記入領域とを含む帳票テンプレート１４を生成するようにした。これにより、情報処理装置１は、帳票の予め記載されている固定文字の領域と、ユーザが文字を記入できる記入領域とが、高い精度で区分けされた帳票テンプレート１４を生成することができる。 As described above, the information processing apparatus 1 identifies, from the electronic data of the plurality of forms 11, the fixed characters described in advance on the form 11 and the entry characters entered by the user, and identifies the identified fixed characters and entry characters. Based on this, an entry area where the user can enter characters in the form 11 and a non-entry area where the user cannot enter characters in the form 11 are identified. The information processing apparatus 1 generates the form template 14 including the identified entry area and non-entry area. Accordingly, the information processing apparatus 1 can generate a form template 14 in which a fixed character area described in advance in a form and an entry area in which a user can enter a character are divided with high accuracy.

また、情報処理装置１は、複数の帳票１１から帳票テンプレート１４を生成するので、ユーザの手間を省くことができる。 Moreover, since the information processing apparatus 1 generates the form template 14 from the plurality of forms 11, it can save the user's trouble.

例えば、情報処理装置１は、ユーザによって生成される、レイアウトに依存した帳票１１の書式情報が入力されなくても、複数の帳票１１から、帳票テンプレート１４を生成することができる。より具体的には、ユーザは、「ひらがな」、「数字」などの入力文字種別や、入力文字数といった、帳票１１の多種の書式情報を作成しなくても、帳票テンプレート１４を生成することができる。 For example, the information processing apparatus 1 can generate the form template 14 from the plurality of forms 11 without inputting the format information of the form 11 depending on the layout generated by the user. More specifically, the user can generate the form template 14 without creating various kinds of format information of the form 11 such as input character types such as “Hiragana” and “number” and the number of input characters. .

また、情報処理装置１は、帳票テンプレート１４を生成するので、それを帳票のひな形として、例えば、端末装置上で使用することができる。また、情報処理装置１は、テンプレート情報１５を生成するので、帳票テンプレート１４とともにシステム開発におけるリソースとして利用することができる。 Moreover, since the information processing apparatus 1 generates the form template 14, it can be used as a template of the form on, for example, a terminal device. Further, since the information processing apparatus 1 generates the template information 15, it can be used as a resource in system development together with the form template 14.

また、情報処理装置１の文字識別部２４は、同種の所定数の帳票において、同じ位置に同じ文字列が含まれている場合、文字列を固定文字と識別する。これにより、情報処理装置１は、ユーザが文字を記入する領域と、そうでない領域とを高い精度で区分けした、帳票テンプレート１４を生成することができる。特に、所定数の値が大きいほど、精度のよい帳票テンプレート１４を生成することができる。 Moreover, the character identification part 24 of the information processing apparatus 1 identifies a character string as a fixed character, when the same character string is contained in the same position in the predetermined number of forms of the same kind. Thereby, the information processing apparatus 1 can generate the form template 14 in which the area where the user enters the character and the area where the user does not enter the area are distinguished with high accuracy. In particular, the larger the predetermined number of values, the more accurate form template 14 can be generated.

また、情報処理装置１は、帳票１１に予め記載されている項目と、項目に対応してユーザが記入した記入項目とを含む補足データ１２が入力されることにより、ユーザが文字を記入する領域と、そうでない領域とを、さらに高い精度で区分けした、帳票テンプレート１４を生成することができる。 In addition, the information processing apparatus 1 receives the supplementary data 12 including an item preliminarily described in the form 11 and an entry item entered by the user corresponding to the item, thereby allowing the user to enter characters. Then, it is possible to generate the form template 14 in which the areas that are not so are classified with higher accuracy.

また、補足データ１２は、帳票１１に予め記載されている項目と、項目に対応してユーザが記入した記入項目とを含み、レイアウトに依存した情報を含まない。従って、ユーザは、簡単な書式で補足データ１２を作成することができ、ユーザに手間をかけることなく、ユーザが文字を記入する領域とそうでない領域とを高い精度で区分けした、帳票テンプレート１４を生成することができる。 The supplementary data 12 includes items preliminarily described in the form 11 and entries entered by the user corresponding to the items, and does not include information depending on the layout. Therefore, the user can create supplementary data 12 in a simple format, and can create a form template 14 in which the user can enter the character entry area and the area where the user does not enter with high accuracy without taking time and effort. Can be generated.

また、情報処理装置１は、文字識別部２４によって固定文字と識別された帳票１１の文字列の確率を格納した学習情報ＴＢ４７をさらに有し、文字識別部２４は、学習情報ＴＢ４７に記憶されている確率を用いて、帳票１１に記載されている固定文字を識別するようにした。これにより、情報処理装置１は、ユーザに手間をかけることなく、ユーザが文字を記入する領域とそうでない領域とを高い精度で区分けした、帳票テンプレート１４を生成することができる。 The information processing apparatus 1 further includes learning information TB47 that stores the probability of the character string of the form 11 that has been identified as a fixed character by the character identifying unit 24. The character identifying unit 24 is stored in the learning information TB47. The fixed character described in the form 11 is identified using the probability of Thereby, the information processing apparatus 1 can generate the form template 14 in which the area in which the user enters the character and the area in which the user does not enter the area are distinguished with high accuracy without taking time and effort for the user.

また、情報処理装置１は、複数の帳票１１による文字識別、学習情報ＴＢ４７による文字識別、および補足データ１２による文字識別を組み合わせることにより、さらに精度の高い帳票テンプレート１４を生成することができる。 Further, the information processing apparatus 1 can generate a form template 14 with higher accuracy by combining character identification with a plurality of forms 11, character identification with learning information TB47, and character identification with supplementary data 12.

さらに、情報処理装置１は、所定の方式に基づいてグルーピングした固定文字と記入文字とが１つのグループに属している場合、別々のグループに属するようにグループを分割する。これにより、情報処理装置１は、ユーザが文字を記入する領域とそうでない領域とを高い精度で区分けした、帳票テンプレート１４を生成することができる。 Furthermore, when the fixed character and the entry character grouped based on a predetermined method belong to one group, the information processing apparatus 1 divides the group so as to belong to different groups. Thereby, the information processing apparatus 1 can generate the form template 14 in which the area where the user enters the character and the area where the user does not enter the area are distinguished with high accuracy.

なお、上記では、帳票１１は、紙媒体としたが、予めＯＣＲ機能および画像識別処理機能で、紙媒体の帳票１１を電子データに変換し、情報処理装置１に入力するようにしてもよい。この場合、データ変換部２１は、ＯＣＲ機能および画像識別処理機能を備えなくてよい。 In the above description, the form 11 is a paper medium. However, the form 11 of the paper medium may be converted into electronic data and input to the information processing apparatus 1 using the OCR function and the image identification processing function in advance. In this case, the data conversion unit 21 does not have to have an OCR function and an image identification processing function.

また、上記で説明した各ＴＢは、一例であり、図示した例に限られない。また、各情報は、ＴＢ構造に限られず、他のデータ構造であってもよい。 Each TB described above is an example, and is not limited to the illustrated example. Further, each information is not limited to the TB structure, but may be another data structure.

また、上記で述べた同じ位置は、多少ずれた位置も含む。例えば、複数の帳票１１において、所定範囲でずれている位置に記載されている文字列も同一の位置と判断してよい。 Further, the same position described above includes a slightly shifted position. For example, in a plurality of forms 11, a character string described at a position shifted within a predetermined range may be determined as the same position.

また、図４に示したフローチャートの処理単位は、情報処理装置１の処理を理解容易にするために、主な処理内容に応じて分割したものである。処理単位の分割の仕方や名称によって、本願発明が制限されることはない。情報処理装置１の処理は、処理内容に応じて、さらに多くの処理単位に分割することもできる。また、１つの処理単位がさらに多くの処理を含むように分割することもできる。さらに、上記のフローチャートの処理順序も、図示した例に限られるものではない。 Further, the processing unit of the flowchart shown in FIG. 4 is divided according to the main processing contents in order to make the processing of the information processing apparatus 1 easy to understand. The present invention is not limited by the way of dividing the processing unit or the name. The processing of the information processing apparatus 1 can be divided into more processing units according to the processing content. Moreover, it can also divide | segment so that one process unit may contain many processes. Further, the processing order of the above flowchart is not limited to the illustrated example.

また、上記では、情報処理装置１は、複数の帳票１１から電子化帳票１３、帳票テンプレート１４、およびテンプレート情報１５を生成するとしたが、補足データ１２が入力された場合、１つの帳票１１から電子化帳票１３、帳票テンプレート１４、およびテンプレート情報１５を生成することが可能である。補足データ１２には、固定文字の情報と記入文字との情報とが含まれているからである。また、情報処理装置１は、ある帳票のすべての文字列について学習が行われていた場合、複数の帳票１１のデータがなくても固定文字であるか記入文字であるかの識別が可能である。例えば、情報処理装置１は、ある１つの帳票１１をデータ変換したとする。そして、そのデータ変換された帳票１１のすべての文字列は、過去において学習されていたとする。この場合、情報処理装置１は、データ変換された１つの帳票１１から、固定文字であるか記入文字であるかの識別ができる。 In the above description, the information processing apparatus 1 generates the computerized form 13, the form template 14, and the template information 15 from the plurality of forms 11. However, when supplementary data 12 is input, the information processing apparatus 1 generates the electronic form 13 from the one form 11. It is possible to generate the chemical form 13, the form template 14, and the template information 15. This is because the supplementary data 12 includes information on fixed characters and information on written characters. In addition, when learning is performed for all character strings of a certain form, the information processing apparatus 1 can identify whether the character is a fixed character or an entry character even if there is no data of a plurality of forms 11. . For example, it is assumed that the information processing apparatus 1 performs data conversion on a certain form 11. Then, it is assumed that all the character strings of the data-converted form 11 have been learned in the past. In this case, the information processing apparatus 1 can identify whether it is a fixed character or an input character from one data-converted form 11.

また、図１１で説明した学習情報ＴＢ４７は、固定文字確率を格納するとしたが、記入文字確率を格納するようにしてもよい。すなわち、文字識別部２４は、記入文字である確率を取得するようにしてもよい。 Further, although the learning information TB47 described with reference to FIG. 11 stores the fixed character probability, it may store the entry character probability. That is, the character identification unit 24 may acquire the probability of being a written character.

さらに、本件に、他の任意の構成物や工程が付加されていてもよい。 Furthermore, other arbitrary components and processes may be added to the present case.

１：情報処理装置、１１：帳票、１２：補足データ、１３：電子化帳票、１４：帳票テンプレート、１５：テンプレート情報、２１：データ変換部、２２：入力部、２３：文字補正部、２４：文字識別部、２５：領域識別部、２６：グルーピング補正部、２７：生成部、２８：記憶部、３１：ＣＰＵ、３２：ＲＡＭ、３３：ＲＯＭ、３４：ＨＤＤ、３５：スキャナ、３６：ドライブ、３７：通信インタフェース、３８：バス、４１，４３，４４ａ〜４４ｃ，４５，４６，４８，４９：文字列処理ＴＢ、４２：領域処理ＴＢ、４７：学習情報ＴＢ、５０，５１，５３〜５５：判定処理ＴＢ、５２：帳票、５２ａ〜５２ｃ：点、５６，５７：点線枠 1: Information processing device, 11: Form, 12: Supplementary data, 13: Computerized form, 14: Form template, 15: Template information, 21: Data conversion section, 22: Input section, 23: Character correction section, 24: Character identification unit, 25: area identification unit, 26: grouping correction unit, 27: generation unit, 28: storage unit, 31: CPU, 32: RAM, 33: ROM, 34: HDD, 35: scanner, 36: drive, 37: Communication interface, 38: Bus, 41, 43, 44a to 44c, 45, 46, 48, 49: Character string processing TB, 42: Area processing TB, 47: Learning information TB, 50, 51, 53-55: Determination process TB, 52: form, 52a to 52c: point, 56, 57: dotted line frame

Claims

A character identification unit for identifying fixed characters previously written in the form and entry characters entered by the user from image information of a plurality of forms;
Based on the fixed character and the input character identified by the character identification unit, an area identification unit for identifying an entry area in which a user can enter a character in the form and a non-entry area in which the user cannot enter a character in the form; ,
A generating unit that generates a template of the form based on the entry region and the non-entry region identified by the region identification unit;
An information processing apparatus comprising:

The information processing apparatus according to claim 1,
In the information processing apparatus, the character identification unit identifies the character string as a fixed character when the same character string is included in the same position in a predetermined number of the same type of forms.

The information processing apparatus according to claim 1, wherein:
The area identifying unit identifies a region of the character string of the form identified as a fixed character by the character identifying unit as a non-entry region, and a region of the character string of the form identified as a written character by the character identifying unit An information processing apparatus characterized by distinguishing from an entry area.

The information processing apparatus according to any one of claims 1 to 3,
The character identification unit uses item information including an item preliminarily described in the form and an entry entered by the user corresponding to the item, and a fixed character and an input character described in the form. An information processing apparatus characterized by identifying.

The information processing apparatus according to claim 4,
The character identification unit identifies the character string of the form corresponding to the item of the item information as a fixed character, and identifies the character string of the form corresponding to the entry of the item information as an input character. Information processing apparatus.

The information processing apparatus according to any one of claims 1 to 5,
A storage unit that stores a learning result related to identification of the character string of the form identified as a fixed character by the character identification unit;
The information processing apparatus, wherein the character identification unit identifies fixed characters described in the form using the learning result stored in the storage unit.

The information processing apparatus according to claim 6,
The learning result is a probability or score of a character string of the form identified as a fixed character by the character identification unit,
The character identification unit identifies a character as a fixed character when a probability or score of a character string described in the form is a predetermined value or more.

The information processing apparatus according to claim 6,
The learning result is a probability or score of a character string of the form identified as a fixed character by the character identification unit,
The character identifying unit identifies fixed characters and written characters described in the form using one or both of the item information and the probability or score stored in the storage unit. Information processing apparatus.

The information processing apparatus according to any one of claims 1 to 8,
The image information of the form is grouped based on a predetermined method, and when a fixed character and an entry character belong to one group, a grouping correction unit that divides the group so as to belong to different groups is further provided. An information processing apparatus comprising:

A template generation method for an information processing device,
A character identification step for identifying, from image information of a plurality of forms, fixed characters previously written in the form and entry characters entered by the user;
Based on the fixed character and the entry character identified by the character identification step, an area identification step for identifying an entry area in which the user can enter a character in the form and a non-entry area in which the user cannot enter the form;
A template generation method comprising: generating a template of the form based on the entry area and the non-entry area identified by the area identification step.

A program for an information processing apparatus that generates a template for a form,
The program is
A character identification step for identifying, from image information of a plurality of forms, fixed characters previously written in the form and entry characters entered by the user;
Based on the fixed character and the entry character identified by the character identification step, an area identification step for identifying an entry area in which the user can enter a character in the form and a non-entry area in which the user cannot enter the form;
Generating a template for the form based on the entry area and the non-entry area identified by the area identification step;
The information processing apparatus executes the program.