JP7365845B2

JP7365845B2 - Learning devices, learning methods, and programs

Info

Publication number: JP7365845B2
Application number: JP2019189458A
Authority: JP
Inventors: 美恵大串; 貴広馬場; 陽太 ▲高▼岡; 英雄寺田
Original assignee: Open Stream Inc
Current assignee: Open Stream Inc
Priority date: 2019-10-16
Filing date: 2019-10-16
Publication date: 2023-10-20
Anticipated expiration: 2039-10-16
Also published as: JP2021064260A

Description

本発明は、学習装置、学習方法、及びプログラムに関する。 The present invention relates to a learning device, a learning method, and a program.

帳票などの文書をスキャナー等により読み込むことにより作成されたスキャン画像から、画像内の文字情報を抽出する技術がある（例えば、特許文献１－２参照）。特許文献１の技術では、画像内の文字をその位置に基づいて構造化することにより、文字情報の誤りを修正し易くする技術が開示されている。特許文献１の構造化とは、文字情報を一群の情報ごとにまとめ、まとめた情報の階層関係を特定し、表現すること、と記載されている。例えば、画像から、タイトル、文書作成者、及び文書作成日等の文字情報が抽出された場合、構造化されたデータでは、最上位の階層にタイトルが示され、その下層に文書作成者、及び文書作成日が示される。特許文献２の技術では、画像内の文字情報と、罫線の特徴を示す特徴情報を抽出する。これにより、文書を検索する際に、文書に記載された文字に加えて、文書に記載された罫線の特徴を指定することができ、効率よく検索することが可能である。 There is a technique for extracting character information in an image from a scanned image created by reading a document such as a form using a scanner or the like (see, for example, Patent Documents 1 and 2). The technique of Patent Document 1 discloses a technique that makes it easier to correct errors in character information by structuring characters in an image based on their positions. Structuring in Patent Document 1 describes that text information is grouped into groups of information, and the hierarchical relationship of the grouped information is specified and expressed. For example, when character information such as title, document creator, and document creation date are extracted from an image, in structured data, the title is shown at the top level, and the document creator and The document creation date is indicated. The technique disclosed in Patent Document 2 extracts character information in an image and feature information indicating features of ruled lines. Thereby, when searching for a document, in addition to the characters written in the document, the characteristics of the ruled lines written in the document can be specified, making it possible to search efficiently.

帳票を電子化する場合、そのレイアウトが変更されることが多い。帳票が印字された紙面と、スマートフォンなどの電子機器の画面とでは、アスペクト比が互いに異なり、紙の帳票のレイアウトを変更することなく、そのまま電子機器の画面に表示させると、表示の縮尺によっては文書の一部が表示できなかったり、画像全体を表示させようとすると、かなり縮小されてしまい文字が読み取り難くなってしまったりする可能性があるためである。レイアウトを変更する場合には、変換前の帳票に記載されていた内容を、変換後の帳票に過不足なく反映させる必要がある。この対策として、例えば、特許文献１－２の技術を適用して帳票のレイアウトを変更することが考えられる。特許文献１－２の技術を用いれば、帳票に記載された文字の構造、及び罫線の特徴を維持して、レイアウトを変更することが可能となる。 When digitizing a form, its layout is often changed. The aspect ratio of the paper on which the form is printed and the screen of an electronic device such as a smartphone is different, and if you display the paper form as it is on the screen of the electronic device without changing the layout, the aspect ratio may differ depending on the display scale. This is because part of the document may not be displayed, or if you try to display the entire image, it may be reduced considerably and the text may become difficult to read. When changing the layout, it is necessary to reflect exactly what was written in the form before conversion on the form after conversion. As a countermeasure to this problem, for example, it is possible to apply the technology of Patent Documents 1-2 to change the layout of the form. By using the techniques of Patent Documents 1 and 2, it is possible to change the layout while maintaining the character structure and ruled line characteristics written on the form.

特開２０１９－８２８１４号公報JP2019-82814A 特開２００８－４０８３４号公報Japanese Patent Application Publication No. 2008-40834

しかしながら、文字の構造、及び罫線の特徴を維持してレイアウトを変更しても、変換前の帳票に記載されていた内容を、変換後の帳票に過不足なく反映させることができない。帳票には、必要事項を記入するための記入枠が存在するものが多い。このような記入枠のほとんどが、文字を含まない、単純な矩形で示される。このような記入枠それ自体からは文字の情報を抽出することはできない。このため特許文献１の技術では、記入枠などの矩形を含む帳票に記載されている事項すべてについて階層構造を判定することが困難である。また、特許文献２を用いてレイアウト変更後の帳票に変更前の罫線の特徴が維持されたとしても、罫線で区分される何れの領域に文字を記載するか、或いは記載しないで記入枠とするかが判らなければ、適切にレイアウト変換を行うことができない。このように、従来の技術をそのまま利用するのみでは、帳票に記載されている事項（矩形を含む）の意味的な繋がり（構造）を維持しながら、レイアウトを変更することが困難であった。 However, even if the layout is changed while maintaining the character structure and the characteristics of the ruled lines, the content written in the form before conversion cannot be reflected in the form after conversion without excess or deficiency. Many forms have entry frames for entering necessary items. Most of these entry frames are shown as simple rectangles that do not contain any text. Character information cannot be extracted from such an entry frame itself. Therefore, with the technique of Patent Document 1, it is difficult to determine the hierarchical structure of all items written in a form including rectangular shapes such as entry frames. Furthermore, even if the characteristics of the ruled lines before the change are maintained in the form after the layout has been changed using Patent Document 2, it is still difficult to decide in which areas divided by the ruled lines characters should be written, or in which areas they should not be written and should be used as entry frames. If this is not known, layout conversion cannot be performed appropriately. As described above, it is difficult to change the layout while maintaining the semantic connection (structure) of the items (including rectangles) written on the form by simply using the conventional technology as is.

本発明は、このような状況に鑑みてなされたもので、矩形を含む文書のレイアウトを変換するために必要な情報を推定する学習済みモデルを生成することができる学習装置、学習方法、及びプログラムを提供する。 The present invention was made in view of this situation, and provides a learning device, a learning method, and a program that can generate a trained model that estimates information necessary for converting the layout of a document including rectangles. I will provide a.

本発明の上述した課題を解決するために、本発明は、学習用画像に含まれる文字と矩形とのそれぞれの領域を示す領域情報を取得する領域情報取得部と、前記学習用画像に含まれる矩形の階層構造を示す構造情報を取得する構造情報取得部と、前記領域情報及び前記構造情報に基づいて、前記学習用画像に含まれる矩形のうち着目矩形に関する情報を入力用データとし、前記着目矩形の階層構造を教師データとする学習用データセットを生成する学習用データセット生成部と、前記学習用データセットを用いて学習モデルに学習させた学習結果として、入力された画像に含まれる矩形における前記構造情報を出力する学習済みモデルを生成する学習済みモデル生成部と、を備える学習装置である。 In order to solve the above-mentioned problems of the present invention, the present invention includes an area information acquisition unit that acquires area information indicating respective areas of characters and rectangles included in a learning image; a structure information acquisition unit that acquires structural information indicating a hierarchical structure of rectangles; and a structure information acquisition unit that acquires structural information indicating a hierarchical structure of rectangles; A learning dataset generation unit that generates a learning dataset using a rectangular hierarchical structure as training data, and a learning dataset that generates a rectangle included in an input image as a learning result of training a learning model using the training dataset. A trained model generation unit that generates a trained model that outputs the structural information in the learning device.

また、本発明は、上述の学習装置において、前記学習用データセット生成部は、前記着目矩形、前記着目矩形の位置から所定の第１範囲内に位置する文字、及び前記着目矩形の位置から所定の第２範囲内に位置する矩形のそれぞれの前記領域情報を前記入力用データとする。 The present invention also provides the above-mentioned learning device, in which the learning data set generation unit generates a set of data from the rectangle of interest, a character located within a predetermined first range from the position of the rectangle of interest, and a character located within a predetermined first range from the position of the rectangle of interest. The area information of each rectangle located within the second range is used as the input data.

また、本発明は、上述の学習装置において、前記領域情報に基づいて、前記学習用画像に含まれる文字の領域に示される第１文字に対応する特定の第２文字を含む意味タグ情報を生成する意味タグ情報生成部を更に備え、前記学習用データセット生成部は、前記領域情報に示される文字について、当該文字の前記意味タグ情報を入力用データに用いる。 Further, in the above learning device, the present invention generates semantic tag information including a specific second character corresponding to a first character shown in a character area included in the learning image based on the area information. The learning data set generating section further includes a semantic tag information generating section that uses the semantic tag information of the character indicated in the region information as input data.

また、本発明は、領域情報取得部が、学習用画像に含まれる文字と矩形とのそれぞれの領域を示す領域情報を取得し、構造情報取得部が、前記学習用画像に含まれる矩形の階層構造を示す構造情報を取得し、学習用データセット生成部が、前記領域情報、及び前記構造情報に基づいて、前記学習用画像に含まれる矩形から選択した着目矩形に関する複数の情報を入力用データとし、前記着目矩形の階層構造を教師データとする学習用データセットを生成し、学習済みモデル生成部が、前記学習用データセットを用いて学習モデルに学習させた学習結果として、入力された画像に含まれる矩形における前記構造情報を出力する学習済みモデルを生成する学習方法である。 Further, in the present invention, the area information acquisition unit acquires area information indicating respective areas of characters and rectangles included in the learning image, and the structure information acquisition unit acquires area information indicating the respective areas of characters and rectangles included in the learning image. The learning dataset generation unit obtains structural information indicating the structure, and generates a plurality of pieces of information regarding a rectangle of interest selected from rectangles included in the learning image as input data based on the region information and the structural information. Then, a training dataset is generated using the hierarchical structure of the rectangle of interest as training data, and the trained model generation unit generates the input image as a learning result of training the learning model using the training dataset. This is a learning method that generates a trained model that outputs the structural information on a rectangle included in the rectangle.

また、本発明は、コンピュータを、上記に記載の学習装置として動作させるためのプログラムであって、前記コンピュータを前記学習装置が備える各部として機能させるためのプログラムである。 Further, the present invention is a program for causing a computer to operate as the learning device described above, and a program for causing the computer to function as each part included in the learning device.

本発明によれば、矩形を含む文書のレイアウトを変換するために必要な情報を推定する学習済みモデルを生成することができる。 According to the present invention, it is possible to generate a trained model that estimates information necessary for converting the layout of a document including rectangles.

実施形態に係る学習装置１０の構成例を示すブロック図である。FIG. 1 is a block diagram showing a configuration example of a learning device 10 according to an embodiment. 実施形態に係る領域データを説明する図である。It is a figure explaining area data concerning an embodiment. 実施形態に係る構造化データを説明する図である。FIG. 2 is a diagram illustrating structured data according to an embodiment. 実施形態に係る変換テーブル１７０の構成例を示す図である。It is a figure showing an example of composition of conversion table 170 concerning an embodiment. 実施形態に係る矩形情報１７１の構成例を示す図である。It is a figure showing an example of composition of rectangle information 171 concerning an embodiment. 実施形態に係る意味タグ情報１７２の構成例を示す図である。It is a figure showing an example of composition of semantic tag information 172 concerning an embodiment. 実施形態に係る学習用データセット１７３の構成例を示す図である。It is a diagram showing an example of the configuration of a learning data set 173 according to the embodiment. 実施形態に係る学習装置１０が行う処理を説明する図である。FIG. 2 is a diagram illustrating processing performed by the learning device 10 according to the embodiment. 実施形態に係る学習装置１０が行う学習用のデータセットを作成する処理の流れを示すフローチャートである。It is a flowchart which shows the flow of processing which creates a data set for learning performed by learning device 10 concerning an embodiment. 実施形態に係る学習装置１０が行う学習モデルに学習させる処理の流れを示すフローチャートである。It is a flowchart which shows the flow of the process which makes a learning model learn, which is performed by the learning device 10 according to the embodiment.

以下、発明の実施形態について図面を参照しながら説明する。 Embodiments of the invention will be described below with reference to the drawings.

学習装置１０は、矩形を含む文書のレイアウトを変換するために必要な情報を推定する学習済みモデルを生成する。 The learning device 10 generates a trained model that estimates information necessary for converting the layout of a document including rectangles.

以下の説明では、レイアウトを変換する対象となる文書が帳票である場合を例示して説明するが、これに限定されることはない。レイアウトを変換する対象は、少なくとも文字と矩形とが含まれる文書であればよく、例えば、アンケート、問診票、テスト問題、定型文テンプレート、アイディアシートなど、任意の文書であってよい。文書に含まれる矩形とは、文書において長方形や正方形など四角形状に囲まれた領域を示す。矩形は、実線で囲まれた領域のみならず、点線や特定の記号や図形により囲まれた矩形の領域、或いは、背景色の濃淡等により区分される矩形の領域を含む。また、文書に含まれる文字とは、単体の文字のみならず、複数の文字からなる文字列や、文字群を含む。 In the following explanation, a case will be explained in which the document whose layout is to be converted is a form, but the present invention is not limited to this. The target for layout conversion may be any document as long as it includes at least characters and rectangles, and may be any document such as a questionnaire, medical questionnaire, test question, fixed phrase template, or idea sheet. A rectangle included in a document refers to an area surrounded by a quadrilateral shape such as a rectangle or a square in the document. A rectangle includes not only an area surrounded by a solid line, but also a rectangular area surrounded by a dotted line, a specific symbol, or a figure, or a rectangular area divided by the shade of a background color. Further, characters included in a document include not only a single character but also a string of characters and a group of characters.

レイアウトを変換するために必要な情報とは、帳票に含まれる文字及び矩形の階層構造を示す情報（以下、構造化データと称する）である。帳票に含まれる文字及び矩形の階層構造が判れば、その構造を維持したままレイアウトを変換することができる。したがって、レイアウト変換前と変換後において帳票に示される文字や記入欄等とそれらの相対的な位置関係を維持することができる。すなわち、帳票が示している内容を維持したままレイアウトを変更するためには、帳票に含まれる文字及び矩形の構造化データを抽出する必要がある。 The information necessary to convert the layout is information indicating the hierarchical structure of characters and rectangles included in the form (hereinafter referred to as structured data). If the hierarchical structure of characters and rectangles included in a form is known, the layout can be converted while maintaining that structure. Therefore, it is possible to maintain the characters, entry fields, etc. shown on the form and their relative positional relationships before and after the layout conversion. That is, in order to change the layout while maintaining the contents shown in the form, it is necessary to extract structured data of characters and rectangles included in the form.

構造化データの例を説明する。図２に示すように、帳票が、矩形の領域Ｋ１～Ｋ５を含む場合を考える。この場合、図３に示すように、領域Ｋ１～Ｋ３の階層構造は、上位の階層に領域Ｋ１、その下に領域Ｋ２、Ｋ３が従属されるという構造となる。構造化データは、この様な階層構造を示す情報である。例えば、領域Ｋ４、Ｋ５の構造化データは、上位の階層に領域Ｋ４、その下に領域Ｋ５が従属されるという階層構造を示す情報である。 An example of structured data will be explained. As shown in FIG. 2, consider a case where a form includes rectangular areas K1 to K5. In this case, as shown in FIG. 3, the hierarchical structure of the regions K1 to K3 is such that the region K1 is at the upper level, and the regions K2 and K3 are subordinated therebelow. Structured data is information that shows such a hierarchical structure. For example, the structured data of areas K4 and K5 is information indicating a hierarchical structure in which area K4 is at the upper level and area K5 is subordinated below it.

以下では、学習装置１０が、帳票に示される「矩形」の階層構造を特定する場合を例示して説明する。帳票に示される「文字」の階層構造を特定する場合にも同様の方法を適用することができる。 Below, a case where the learning device 10 specifies the hierarchical structure of a "rectangle" shown in a form will be described as an example. A similar method can be applied to specifying the hierarchical structure of "characters" shown on a form.

また、以下では、階層構造として、帳票に含まれる矩形の従属元となる矩形又は文字の識別情報（以下、親ＩＤと称する）を判定する場合を例示して説明する。この場合、構造化データは、矩形と、その矩形の親ＩＤとを対応付けた情報である。階層構造として親ＩＤを判定する方法を用いることによって、データ容量の増加を抑制しつつ、矩形の構造を一意に特定することができるため好適である。しかしながら、これに限定されることはない。矩形の階層構造を特定する方法として、矩形の従属先となる矩形又は文字の識別情報（以下、子ＩＤと称する）を判定することも考えられる。この場合、一つの矩形に複数の文字や矩形が従属する構造が有り得るため、矩形に複数の子ＩＤを対応付けられるような構成をとる必要があるためデータ容量の増加を招く要因となり得る。矩形の階層構造を特定する方法は、少なくとも階層構造が特定できれば、任意の方法であってよい。矩形の階層構造を特定する方法は、矩形に親ＩＤを対応付ける方法であってもよいし、矩形に子ＩＤを対応付ける方法であってもよいし、矩形に親ＩＤと子ＩＤの双方を対応付ける方法であってもよいし、他の方法であってもよいのは勿論である。 Further, as a hierarchical structure, a case will be described below in which identification information of a rectangle or a character (hereinafter referred to as a parent ID) that is a dependent source of a rectangle included in a form is determined as an example. In this case, the structured data is information that associates a rectangle with a parent ID of that rectangle. By using a method of determining the parent ID as a hierarchical structure, it is possible to uniquely specify the rectangular structure while suppressing an increase in data capacity, which is preferable. However, it is not limited to this. As a method of specifying the hierarchical structure of rectangles, it is also possible to determine the identification information (hereinafter referred to as child ID) of a rectangle or a character to which the rectangle is dependent. In this case, since there may be a structure in which multiple characters or rectangles are subordinate to one rectangle, it is necessary to adopt a configuration that allows multiple child IDs to be associated with a rectangle, which may lead to an increase in data capacity. The method for specifying the rectangular hierarchical structure may be any method as long as at least the hierarchical structure can be specified. The method for specifying the hierarchical structure of rectangles may be a method of associating a parent ID with a rectangle, a method of associating a child ID with a rectangle, or a method of associating a rectangle with both a parent ID and a child ID. Of course, other methods may also be used.

学習装置１０の構成について、図１を用いて説明する。図１は、実施形態に係る学習装置１０の構成例を示すブロック図である。図１に示すように、学習装置１０は、例えば、領域情報取得部１１と、矩形情報生成部１２と、意味タグ情報生成部１３と、構造情報取得部１４と、学習用データセット生成部１５と、学習済みモデル生成部１６と、記憶部１７とを備える。 The configuration of the learning device 10 will be explained using FIG. 1. FIG. 1 is a block diagram showing a configuration example of a learning device 10 according to an embodiment. As shown in FIG. 1, the learning device 10 includes, for example, a region information acquisition section 11, a rectangle information generation section 12, a semantic tag information generation section 13, a structure information acquisition section 14, and a learning dataset generation section 15. , a trained model generation section 16 , and a storage section 17 .

領域情報取得部１１は、学習用画像における矩形又は文字の領域を示す領域情報を取得する。学習用画像は、後述する学習モデルに学習させるための画像である。学習用画像は、矩形と文字とを含む画像であって、矩形の階層構造が既知である画像である。領域情報は、学習用画像における画素座標に、当該画素座標によって特定される画素が、矩形又は文字の領域に含まれることを示す識別情報が対応付けられた情報である。領域情報取得部１１は、取得した領域情報のうち、矩形の領域を示す領域情報を矩形情報生成部１２に出力する。領域情報取得部１１は、取得した領域情報のうち、文字の領域を示す領域情報を意味タグ情報生成部１３に出力する。 The area information acquisition unit 11 acquires area information indicating a rectangular or character area in the learning image. The learning image is an image for learning by a learning model, which will be described later. The learning image is an image that includes a rectangle and characters, and the hierarchical structure of the rectangles is known. The region information is information in which pixel coordinates in the learning image are associated with identification information indicating that the pixel specified by the pixel coordinates is included in a rectangular or character region. Of the acquired area information, the area information acquisition unit 11 outputs area information indicating a rectangular area to the rectangular information generation unit 12. The area information acquisition unit 11 outputs area information indicating a character area out of the acquired area information to the semantic tag information generation unit 13.

矩形情報生成部１２は、矩形の領域を示す領域情報に基づいて、矩形情報を生成する。矩形情報は、矩形の領域における位置を示す座標と、矩形の領域であることを示す識別情報とが対応付けられた情報である。ここで、領域における位置を示す座標とは、例えば、領域の形状が四角形である場合、当該四角形の四つの頂点のうち、対角線上に位置する二つの頂点の座標である。或いは、領域における位置を示す座標は、四角形の四つの頂点のうち予め定めた特定の頂点（例えば、左下の頂点）の座標と、縦横それぞれの長さを示す情報であってもよい。矩形情報生成部１２は、生成した矩形情報を構造情報取得部１４に出力する。矩形情報生成部１２は、成した矩形情報を記憶部１７の矩形情報１７１として記憶させる。 The rectangle information generation unit 12 generates rectangle information based on area information indicating a rectangular area. Rectangle information is information in which coordinates indicating a position in a rectangular area are associated with identification information indicating that the area is a rectangular area. Here, the coordinates indicating the position in the region are, for example, when the shape of the region is a quadrangle, the coordinates of two vertices located diagonally among the four vertices of the quadrangle. Alternatively, the coordinates indicating the position in the region may be information indicating the coordinates of a predetermined specific vertex (for example, the lower left vertex) among the four vertices of the quadrangle, and the length and width of the rectangle. The rectangle information generation unit 12 outputs the generated rectangle information to the structure information acquisition unit 14. The rectangle information generation unit 12 stores the generated rectangle information as rectangle information 171 in the storage unit 17.

意味タグ情報生成部１３は、文字の領域を示す領域情報に基づいて、意味タグ情報を生成する。意味タグ情報は、領域に示された文字の意味に応じたタグ（意味タグ）を付与した情報である。意味タグは、意味的に同等の文言であることを示す何らかの情報であればよい。意味タグは、例えば、意味的に同等の文言を代表させた文言であり、より具体的には、「お住まい」、「住所」、「おところ」、「ご住所」などの文言が、「住所」であることを示す情報である。意味タグ情報生成部１３が、意味タグ情報を生成することにより、意味的に同等の文言を、一つの文言に統一させることができる。したがって、文言を統一しない場合と比較して、後段の処理を簡素にでき、後述する学習モデルが階層構造を推定し易くなる。 The semantic tag information generation unit 13 generates semantic tag information based on area information indicating a character area. The meaning tag information is information to which a tag (meaning tag) is attached according to the meaning of the character shown in the area. The semantic tag may be any information that indicates that the words are semantically equivalent. Semantic tags are, for example, words that represent semantically equivalent words, and more specifically, words such as "home", "address", "place", "address", etc. This information indicates that the address is "address". By generating semantic tag information, the semantic tag information generation unit 13 can unify semantically equivalent sentences into one sentence. Therefore, compared to the case where the wording is not unified, subsequent processing can be simplified, and the learning model described later can more easily estimate the hierarchical structure.

意味タグ情報生成部１３は、光学文字認識（ＯＣＲ）等の既存技術を用いて、文字の領域に示されている文字を認識させる。意味タグ情報生成部１３は、文字認識させた結果に基づいて、文字の意味に応じて設定された文字を、所定の文字に変換することにより意味タグ情報を生成する。意味タグ情報生成部１３は、当該変換に、変換テーブル１７０（図４参照）を用いる。変換テーブル１７０は、記憶部１７に記憶される情報であり、変換前の文字と、変換後の文字とが対応付けられた情報（テーブル）である。例えば、変換テーブル１７０の変換前の文字列には、帳票において頻出する文字であり、かつ表記にばらつきが有り得る文字が示される。変換前の文字列は、住所、おところ、ご住所などである。変換後の文字列には、意味に応じて設定した一つの文字、例えば「住所、おところ、ご住所」に対応する「住所」との文言が示される。 The semantic tag information generation unit 13 uses existing technology such as optical character recognition (OCR) to recognize the characters shown in the character area. The semantic tag information generation unit 13 generates semantic tag information by converting characters set according to the meaning of the characters into predetermined characters based on the result of character recognition. The semantic tag information generation unit 13 uses the conversion table 170 (see FIG. 4) for the conversion. The conversion table 170 is information stored in the storage unit 17, and is information (table) in which characters before conversion and characters after conversion are associated with each other. For example, the character string before conversion in the conversion table 170 shows characters that frequently appear in forms and whose notation may vary. The character string before conversion is an address, place, address, etc. The converted character string shows one character set according to the meaning, for example, the word "address" corresponding to "address, place, address".

意味タグ情報生成部１３は、文字認識結果に基づいて変換テーブル１７０を参照する。意味タグ情報生成部１３は、変換テーブル１７０の変換前に示される文字に、文字認識させた文字と同じ文字が存在する場合、その変換前の文字に対応付けられた、変換後の文字を取得する。意味タグ情報生成部１３は、認識させた文字を、変換テーブル１７０に示される変換後の文字に変換する。意味タグ情報生成部１３は、文字の領域を示す領域情報に、変換後の文字を対応づけることにより意味タグ情報を生成する。意味タグ情報生成部１３は、生成した意味タグ情報を学習用データセット生成部１５に出力する。意味タグ情報生成部１３は、生成した意味タグ情報を、記憶部１７に意味タグ情報１７２として記憶させる。 The semantic tag information generation unit 13 refers to the conversion table 170 based on the character recognition result. If the same character as the recognized character exists in the characters shown before conversion in the conversion table 170, the meaning tag information generation unit 13 acquires the converted character that is associated with the character before conversion. do. The semantic tag information generation unit 13 converts the recognized characters into converted characters shown in the conversion table 170. The semantic tag information generation unit 13 generates semantic tag information by associating the converted characters with area information indicating a character area. The semantic tag information generation unit 13 outputs the generated semantic tag information to the learning data set generation unit 15. The semantic tag information generation unit 13 stores the generated semantic tag information in the storage unit 17 as semantic tag information 172.

なお、意味タグ情報生成部１３は、変換テーブル１７０の変換前に示される文字に、文字認識させた文字と同じ文字が存在しない場合、文字を変換することなく、文字の領域を示す領域情報に、認識させた文字を対応づけることにより意味タグ情報を生成する。 Note that if the same character as the recognized character does not exist in the characters shown before conversion in the conversion table 170, the semantic tag information generation unit 13 converts the character into area information indicating the area of the character without converting the character. , generates semantic tag information by associating the recognized characters.

構造情報取得部１４は、学習用画像における構造情報を取得する。構造情報は、画像における矩形ごとに、矩形とその矩形の階層構造（親ＩＤ）とが対応づけられた情報である。構造情報取得部１４は、取得した構造情報を、学習用データセット生成部１５に出力する。 The structural information acquisition unit 14 acquires structural information in the learning image. The structure information is information in which a rectangle and a hierarchical structure (parent ID) of the rectangle are associated for each rectangle in the image. The structural information acquisition unit 14 outputs the acquired structural information to the learning dataset generation unit 15.

学習用データセット生成部１５は、矩形領域データ、及び意味タグ情報を用いて、学習用データセットを生成する。学習用データセットは、学習モデルに学習させるための入力用データと教師データとが組（セット）になったデータである。学習モデルは、入力された画像における矩形の親ＩＤを、精度よく出力（推定）できるようになるまで、学習用データセットを用いた学習が実行される。 The learning dataset generation unit 15 generates a learning dataset using the rectangular area data and semantic tag information. The learning data set is data that is a set of input data and teacher data for making the learning model learn. The learning model is trained using the learning data set until it can accurately output (estimate) the parent ID of the rectangle in the input image.

学習モデルは、例えば、ＲＮＮ（Recurrent Neural Network）である。しかしながら、これに限定されることはない。学習モデルとして、例えば、ＤＣＮＮ（Deep Convolutional Neural Network）、ＣＮＮ、決定木、階層ベイズ、ＳＶＭ（Support Vector Machine）などの手法、およびこれらを適宜組み合わせた手法によるモデルが用いられてもよい。 The learning model is, for example, an RNN (Recurrent Neural Network). However, it is not limited to this. As the learning model, for example, a model based on a method such as a DCNN (Deep Convolutional Neural Network), a CNN, a decision tree, a hierarchical Bayesian, an SVM (Support Vector Machine), or a method that combines these appropriately may be used.

学習用データセット生成部１５は、学習用画像における着目矩形を抽出する。着目矩形は、学習用データセットにおいて、教師データとしての階層構造が対応付けられる矩形である。学習用データセット生成部１５は、学習用画像において着目矩形から所定の範囲（以下、第１範囲という）にある矩形の矩形情報（以下、近傍矩形群という）を抽出する。学習用データセット生成部１５は、学習用画像において着目矩形から所定の範囲（以下、第２範囲という）にある文字の意味タグ情報（以下、近傍意味タグ群という）を抽出する。学習用データセット生成部１５は、抽出した着目矩形の矩形情報、近傍矩形群、及び近傍意味タグ群を、着目矩形における入力用データとする。学習用データセット生成部１５が抽出した着目矩形の矩形情報、近傍矩形群、及び近傍意味タグ群は、「着目矩形に関する情報」の一例である。学習用データセット生成部１５は、生成した学習用データセットを学習済みモデル生成部１６に出力する。学習用データセット生成部１５は、生成した学習用データセットを、記憶部１７に学習用データセット１７３として記憶させる。 The learning data set generation unit 15 extracts a rectangle of interest in the learning image. The rectangle of interest is a rectangle to which a hierarchical structure as teacher data is associated in the learning data set. The learning data set generation unit 15 extracts rectangle information (hereinafter referred to as a group of neighboring rectangles) of rectangles within a predetermined range (hereinafter referred to as the first range) from the rectangle of interest in the learning image. The learning data set generation unit 15 extracts semantic tag information (hereinafter referred to as a neighboring semantic tag group) of characters within a predetermined range (hereinafter referred to as a second range) from the rectangle of interest in the learning image. The learning data set generation unit 15 uses the extracted rectangle information of the rectangle of interest, a group of neighboring rectangles, and a group of neighboring semantic tags as input data for the rectangle of interest. The rectangle information of the rectangle of interest, a group of neighboring rectangles, and a group of neighboring semantic tags extracted by the learning data set generation unit 15 are examples of "information regarding the rectangle of interest." The learning data set generation unit 15 outputs the generated learning data set to the trained model generation unit 16. The learning data set generation unit 15 stores the generated learning data set in the storage unit 17 as a learning data set 173.

学習済みモデル生成部１６は、学習済みモデルを生成する。学習済みモデルは、学習用データセットを用いて学習モデルに学習させた学習結果であって、入力された画像に含まれる矩形における構造情報を出力するように学習されたモデルである。 The trained model generation unit 16 generates a trained model. The trained model is a learning result obtained by training a learning model using a training data set, and is a model trained to output structural information on a rectangle included in an input image.

学習済みモデル生成部１６は、学習モデルに入力用データを入力させることにより得られる出力が、学習用データセットにおいて入力用データに対応付けられた教師データに近づくように、学習モデルのパラメータの調整を繰り返し行う。これにより、学習モデルは入力された画像に含まれる矩形における構造情報を精度よく出力できるようになる。学習済みモデル生成部１６は、予め定めた終了条件を満たすまで学習させた学習モデルを、学習済みモデルとする。予め定めた終了条件とは、例えば、学習用データセット生成部１５により作成された学習用データセットをすべて学習させたこと、或いは、入力された画像に含まれる矩形における構造情報を推定する精度が所定の閾値以上となったこと等である。 The trained model generation unit 16 adjusts the parameters of the learning model so that the output obtained by inputting the input data to the learning model approaches the teacher data associated with the input data in the learning dataset. Repeat. This allows the learning model to accurately output structural information on rectangles included in the input image. The learned model generation unit 16 sets a learned model that has been trained until a predetermined termination condition is satisfied as a learned model. The predetermined termination condition is, for example, that all the training datasets created by the training dataset generation unit 15 have been trained, or that the accuracy of estimating the structural information in the rectangle included in the input image is high. For example, the threshold value has exceeded a predetermined threshold value.

学習済みモデル生成部１６は、学習の過程において、入力用データセットを学習モデルに入力させる順番を決定する。特に、学習モデルにＲＮＮを用いる場合、学習モデルに入力させるデータの順序が情報を持つ。すなわち、ＲＮＮにおいては、入力されたデータの順序に基づく推定を行う構成を有している。このため、学習モデルに入力させる順番を規定することにより、精度よく親ＩＤを推定することができるようになると考えられる。 The trained model generation unit 16 determines the order in which input data sets are input to the learning model during the learning process. In particular, when an RNN is used as a learning model, the order of data input to the learning model has information. That is, the RNN has a configuration that performs estimation based on the order of input data. Therefore, it is thought that by specifying the order in which the inputs are input to the learning model, it becomes possible to estimate the parent ID with high accuracy.

学習済みモデル生成部１６は、着目矩形、近傍矩形群、近傍意味タグ群のそれぞれの代表座標（例えば、中心座標）をラスター順にソートしたデータを入力用データとする。ここでのラスター順とは、二次元に配置された画素を読み込む（或いは、書込む）際における、所定の方向に沿った読み込み（書き込み）順序である。例えば、ラスター順は、画像における水平方向の左側から右側へ向かう方向に沿う順序であり、且つ垂直方向の上側から下側へ向かう方向である。しかしながら、ラスター順における所定の方向は、任意の方向であってよく、右側から左側へ向かう方向に沿う順序であってもよいし、下側から上側へ向かう方向に沿う順序であってもよい。 The trained model generation unit 16 uses data obtained by sorting representative coordinates (for example, center coordinates) of each of the rectangle of interest, a group of neighboring rectangles, and a group of neighboring semantic tags in raster order as input data. The raster order here refers to the reading (or writing) order along a predetermined direction when reading (or writing) pixels arranged two-dimensionally. For example, the raster order is an order along the horizontal direction from the left to the right in the image, and a vertical direction from the top to the bottom. However, the predetermined direction in the raster order may be any direction, and may be an order from the right side to the left side, or an order from the bottom side to the top side.

記憶部１７は、変換テーブル１７０と、矩形情報１７１と、意味タグ情報１７２と、学習用データセット１７３と、学習済みモデル１７４とを記憶する。 The storage unit 17 stores a conversion table 170, rectangle information 171, meaning tag information 172, a learning data set 173, and a trained model 174.

図２は、実施形態に係る領域情報を説明する図である。図２に示すように、学習用画像から、文字の領域Ｍ１～Ｍ６、及び矩形の領域Ｋ１～Ｋ５のそれぞれの領域が抽出される。領域Ｍ１は、「申込書」の文字が示されている領域である。領域Ｍ２は、「ご住所」の文字が示されている領域である。領域Ｍ３は、「都道府県」の文字が示されている領域である。領域Ｍ４は、「お名前」の文字が示されている領域である。領域Ｍ５は、「記入日」の文字が示されている領域である。領域Ｍ６は、「年月日」の文字が示されている領域である。この例に示すように、本実施形態では、文字の領域を、矩形（四角形）の形状の領域として抽出する。 FIG. 2 is a diagram illustrating area information according to the embodiment. As shown in FIG. 2, character regions M1 to M6 and rectangular regions K1 to K5 are extracted from the learning image. Area M1 is an area where the characters "Application Form" are shown. Area M2 is an area where the characters "address" are shown. Area M3 is an area where the characters "prefecture" are shown. Area M4 is an area where the characters "name" are shown. Area M5 is an area where the characters "Date of Entry" are shown. Area M6 is an area where the characters "Year Month Day" are shown. As shown in this example, in this embodiment, a character area is extracted as a rectangular (square) shaped area.

領域Ｋ１は、領域Ｍ２を囲む矩形が示されている領域である。領域Ｋ２は、領域Ｍ３が枠内の右端に配置されるように、領域Ｍ３を囲む矩形が示されている領域である。領域Ｋ３は、領域Ｋ２の右側に配置される矩形が示されている領域である。領域Ｋ３は、領域Ｍ４を囲む矩形が示されている領域である。領域Ｋ５は、領域Ｋ４の右側に配置される矩形が示されている領域である。 Region K1 is a region in which a rectangle surrounding region M2 is shown. Area K2 is an area in which a rectangle surrounding area M3 is shown so that area M3 is placed at the right end of the frame. Area K3 is an area in which a rectangle placed on the right side of area K2 is shown. Region K3 is a region in which a rectangle surrounding region M4 is shown. Area K5 is an area in which a rectangle placed on the right side of area K4 is shown.

図３は、実施形態に係る構造情報を説明する図である。図３に示すように、領域Ｍ１＃は、文字の領域Ｍ１に示された文字が、変換テーブル１７０に基づいて変換された後の領域を示している。領域Ｍ２＃～Ｍ６＃についても同様に、文字の領域Ｍ２～Ｍ６に示された文字が、変換テーブル１７０に基づいて変換された後の領域を示している。 FIG. 3 is a diagram illustrating structure information according to the embodiment. As shown in FIG. 3, area M1# indicates an area after the characters shown in character area M1 are converted based on conversion table 170. Similarly, the regions M2# to M6# indicate the regions after the characters shown in the character regions M2 to M6 have been converted based on the conversion table 170.

図４は、実施形態に係る変換テーブル１７０の構成例を示す図である。変換テーブル１７０は、例えば、意味タグＩＤ、変換後、変換前などの各項目を備える。意味タグＩＤには、意味タグを一意に識別する識別情報が示される。変換後には変換後の文字が示される。変換前には変換前の文字列が示される。この例では、意味タグＩＤ（Ｅ０００１）に、変換後の文字として「氏名」、変換前の文字として「お名前」、「名前」、「おなまえ」が示されている。 FIG. 4 is a diagram showing a configuration example of the conversion table 170 according to the embodiment. The conversion table 170 includes items such as, for example, meaning tag ID, after conversion, and before conversion. The meaning tag ID indicates identification information that uniquely identifies the meaning tag. After conversion, the converted characters are shown. Before conversion, the character string before conversion is shown. In this example, the meaning tag ID (E0001) shows "name" as characters after conversion, and "name", "name", and "name" as characters before conversion.

図５は、実施形態に係る矩形情報１７１の構成例を示す図である。矩形情報１７１は、例えば、矩形ＩＤ、位置座標１、位置座標２、代表位置座標などの各項目を備える。矩形ＩＤは、学習用画像に含まれる矩形の領域を一意に識別する識別情報である。位置座標1及び位置座標２は、矩形ＩＤにより特定される矩形の領域を特定するための二点の位置座標であって、例えば、矩形の四隅に相当する四つの頂点のうち、対角線上に位置する二つの頂点の座標である。代表位置座標は、矩形ＩＤにより特定される矩形の領域の位置を代表する位置の座標であって、例えば、矩形の領域における中心座標である。代表位置座標は、学習済みモデル生成部１６により入力用データの順序が決定される際に、ラスター順にソートされる代表座標として用いられる。 FIG. 5 is a diagram illustrating a configuration example of rectangle information 171 according to the embodiment. The rectangle information 171 includes items such as rectangle ID, position coordinates 1, position coordinates 2, and representative position coordinates, for example. The rectangle ID is identification information that uniquely identifies a rectangular area included in the learning image. Position coordinate 1 and position coordinate 2 are the position coordinates of two points for specifying the rectangular area specified by the rectangle ID, and for example, among the four vertices corresponding to the four corners of the rectangle, the position coordinates are the coordinates of the two vertices. The representative position coordinates are the coordinates of a position that represents the position of the rectangular area specified by the rectangle ID, and are, for example, the center coordinates of the rectangular area. The representative position coordinates are used as representative coordinates to be sorted in raster order when the learned model generation unit 16 determines the order of input data.

図６は、実施形態に係る意味タグ情報１７２の構成例を示す図である。意味タグ情報１７２は、例えば、文字ＩＤ、文字、意味グループＩＤ、位置座標１、位置座標２、代表位置座標などの各項目を備える。文字ＩＤは、学習用画像に含まれる文字の領域を一意に識別する識別情報である。文字は、文字ＩＤにより特定される文字の領域において文字認識された文字が示される。意味グループＩＤには、文字が変換テーブル１７０におけるいずれの意味グループに対応するかが示される。位置座標1及び位置座標２は、文字ＩＤにより特定される文字の領域を特定するための二点の位置座標である。代表位置座標は、文字ＩＤにより特定される文字の領域の位置を代表する位置の座標である。図２に示すように、本実施形態において、文字の領域は、矩形（四角形）の形状の領域として抽出される。 FIG. 6 is a diagram illustrating a configuration example of the semantic tag information 172 according to the embodiment. The meaning tag information 172 includes items such as, for example, character ID, character, meaning group ID, position coordinate 1, position coordinate 2, and representative position coordinate. The character ID is identification information that uniquely identifies a character area included in the learning image. The characters are shown as characters recognized in the character area specified by the character ID. The meaning group ID indicates which meaning group in the conversion table 170 the character corresponds to. Position coordinate 1 and position coordinate 2 are the position coordinates of two points for specifying the area of the character specified by the character ID. The representative position coordinates are the coordinates of a position that represents the position of the character area specified by the character ID. As shown in FIG. 2, in this embodiment, the character area is extracted as a rectangular (square) shaped area.

図７は、実施形態に係る学習用データセット１７３の構成例を示す図である。学習用データセット１７３は、例えば、矩形ＩＤと、入力用データと、教師データとを備える。矩形ＩＤは、学習用画像に含まれる矩形の領域を一意に識別する識別情報である。入力用データは、矩形ＩＤにより特定される矩形を着目矩形とした場合の入力用データである。入力用データには、位置座標と、近傍文字ＩＤと、近傍矩形ＩＤとが含まれる。位置座標には、着目矩形における近傍を算出する際に基準とする位置座標が示される。近傍文字ＩＤには、着目矩形における近傍意味タグ群の文字ＩＤが示される。この例のように、近傍文字ＩＤには、複数の文字ＩＤが示されていてよい。近傍矩形ＩＤには、着目矩形における近傍矩形群の矩形ＩＤが示される。この例のように、近傍矩形ＩＤには、複数の矩形ＩＤが示されていてよい。教師データには、着目矩形における親ＩＤが示される。 FIG. 7 is a diagram showing a configuration example of the learning data set 173 according to the embodiment. The learning data set 173 includes, for example, a rectangle ID, input data, and teacher data. The rectangle ID is identification information that uniquely identifies a rectangular area included in the learning image. The input data is input data when the rectangle specified by the rectangle ID is the rectangle of interest. The input data includes position coordinates, neighboring character IDs, and neighboring rectangle IDs. The position coordinates indicate the position coordinates that are used as a reference when calculating the neighborhood of the rectangle of interest. The neighborhood character ID indicates the character ID of the neighborhood meaning tag group in the rectangle of interest. As in this example, a plurality of character IDs may be indicated in the neighboring character ID. The neighboring rectangle ID indicates the rectangle ID of a group of neighboring rectangles in the rectangle of interest. As in this example, a plurality of rectangle IDs may be indicated in the neighboring rectangle ID. The teacher data indicates the parent ID of the rectangle of interest.

図８は、実施形態に係る学習装置１０が行う処理を説明する図である。
ステップＳ１において、学習用画像として用意された帳票が帳票分割器に入力され、学習用画像として用意された帳票がＯＣＲに入力される。ここでのＯＣＲは、学習装置１０の意味タグ情報生成部１３の機能の一部としての光学文字認識であることを前提とするが、これに限定されることはなく、ＯＣＲが学習装置１０の外部にある外部装置であってもよい。 FIG. 8 is a diagram illustrating processing performed by the learning device 10 according to the embodiment.
In step S1, a form prepared as a learning image is input to a form splitter, and a form prepared as a learning image is input to OCR. The OCR here assumes that it is optical character recognition as part of the function of the semantic tag information generation unit 13 of the learning device 10, but is not limited to this, and the OCR is It may also be an external device located outside.

ステップＳ２において、帳票分割器は、入力された画像を矩形及び文字それぞれの領域に分割する装置であり、矩形及び文字それぞれの領域情報を出力する。帳票分割器は、入力された学習用画像における矩形の領域情報を学習装置１０に出力する。この図には示されていないが、帳票分割器は、入力された学習用画像における文字の領域情報をＯＣＲに出力するようにしてもよい。 In step S2, the form divider is a device that divides the input image into regions for rectangles and characters, and outputs region information for each rectangle and characters. The form divider outputs rectangular area information in the input learning image to the learning device 10. Although not shown in this figure, the form divider may output character area information in the input learning image to the OCR.

ステップＳ３において、学習装置１０の矩形情報生成部１２は、矩形の領域情報を用いて、矩形情報を生成する。矩形情報生成部１２は、学習用画像において生成した全ての矩形情報を、学習用データセット生成部１５に出力する。 In step S3, the rectangle information generation unit 12 of the learning device 10 generates rectangle information using the rectangle area information. The rectangle information generation unit 12 outputs all the rectangle information generated in the learning image to the learning data set generation unit 15.

ステップＳ４において、意味タグ情報生成部１３は、学習用画像における文字の領域を文字認識させ、ステップＳ５において文字認識された結果を示す情報（文字情報と記載）出力し、ステップＳ６において意味タグ情報を生成する。ステップＳ７において、意味タグ情報生成部１３は、学習用画像において生成した全ての意味タグ情報を、学習用データセット生成部１５に出力する。 In step S4, the semantic tag information generation unit 13 performs character recognition on the character area in the learning image, outputs information (described as character information) indicating the result of character recognition in step S5, and in step S6, the semantic tag information generation unit 13 performs character recognition on the character area in the learning image. generate. In step S7, the semantic tag information generation unit 13 outputs all the semantic tag information generated in the learning image to the learning data set generation unit 15.

ステップＳ８において、学習用データセット生成部１５は、学習用データセット（学習データと記載）の入力用データを生成する。学習済みモデル生成部１６は、入力用データにおける着目矩形、近傍意味タグ群、及び近傍矩形群のそれぞれの中心点をラスター順にソートすることにより、入力用データを学習モデルに入力させる順序を決定する。 In step S8, the learning data set generation unit 15 generates input data of a learning data set (described as learning data). The learned model generation unit 16 determines the order in which the input data is input to the learning model by sorting the center points of the rectangle of interest, the neighborhood semantic tag group, and the neighborhood rectangle group in the input data in raster order. .

ステップＳ９において、学習済みモデル生成部１６は、入力用データを学習モデルに入力させる。ステップＳ１０において、学習済みモデル生成部１６は、学習モデルから得られる出力を、着目矩形の親ＩＤの予測結果として取得する。ステップＳ１１において、学習済みモデル生成部１６は、学習用データセットの教師データ、つまり着目矩形の親ＩＤを取得する。ステップＳ１２において、学習済みモデル生成部１６は、着目矩形の親ＩＤの予測結果と、学習用データセットの教師データとを用いて、損失関数を生成し、その結果を学習モデルに反映させる。 In step S9, the trained model generation unit 16 inputs the input data to the learning model. In step S10, the learned model generation unit 16 obtains the output obtained from the learning model as a prediction result of the parent ID of the rectangle of interest. In step S11, the trained model generation unit 16 acquires the teacher data of the learning data set, that is, the parent ID of the rectangle of interest. In step S12, the trained model generation unit 16 generates a loss function using the prediction result of the parent ID of the rectangle of interest and the teacher data of the learning data set, and reflects the result in the learning model.

図９は、実施形態に係る学習装置１０が行う学習用のデータセットを作成する処理の流れを説明する図である。ステップＳ２０において、学習用データセット生成部１５は、着目矩形の位置座標を取得する。ステップＳ２１において、学習用データセット生成部１５は近傍にある矩形の矩形情報を取得する。ステップＳ２２において、学習用データセット生成部１５は近傍にある文字の意味タグ情報を取得する。ステップＳ２３において、学習用データセット生成部１５は、着目矩形の親ＩＤを取得する。ステップＳ２４において、学習用データセット生成部１５は、入力用データとしての着目矩形、近傍意味タグ群、及び近傍矩形群と、教師データとしての親ＩＤを組み合わせることによって学習用のデータセットを作成する。 FIG. 9 is a diagram illustrating a flow of processing for creating a learning data set performed by the learning device 10 according to the embodiment. In step S20, the learning data set generation unit 15 acquires the position coordinates of the rectangle of interest. In step S21, the learning data set generation unit 15 acquires rectangle information of nearby rectangles. In step S22, the learning data set generation unit 15 acquires semantic tag information of nearby characters. In step S23, the learning data set generation unit 15 acquires the parent ID of the rectangle of interest. In step S24, the learning dataset generation unit 15 creates a learning dataset by combining the rectangle of interest, neighborhood semantic tag group, and neighborhood rectangle group as input data with the parent ID as teacher data. .

図１０は、実施形態に係る学習装置１０が行う学習の流れを説明する図である。ステップＳ３０において、学習済みモデル生成部１６は、入力用データを学習モデルに入力させる。この際、学習済みモデル生成部１６は、入力用データを学習済みモデルに入力させる順序を所定のルールに従い予め決定させておく。ステップＳ３１において、学習済みモデル生成部１６は、学習モデルによる順伝播計算を実施させ、学習モデルから出力を得る。ステップＳ３２において、学習済みモデル生成部１６は、学習モデルから得られた出力と、教師データとの誤差に基づいて損失関数を導出し、損失関数に基づいて誤差逆伝播を実施させる。ステップＳ３３において、学習済みモデル生成部１６は、損失関数に基づく誤差逆伝播により更新した学習モデルのパラメータを記憶させる。ステップＳ３４において、学習済みモデル生成部１６は、所定の終了条件を満たすか否かを判定する。所定の終了条件を満たす場合には、学習モデルに対する学習を完了させ、学習済みモデルとする。所定の終了条件を満たさない場合には、ステップＳ３０に戻り、学習を繰り返す。 FIG. 10 is a diagram illustrating the flow of learning performed by the learning device 10 according to the embodiment. In step S30, the trained model generation unit 16 inputs the input data to the learning model. At this time, the trained model generation unit 16 determines in advance the order in which the input data is input to the trained model according to a predetermined rule. In step S31, the trained model generation unit 16 performs forward propagation calculation using the learning model and obtains an output from the learning model. In step S32, the trained model generation unit 16 derives a loss function based on the error between the output obtained from the learning model and the teacher data, and performs error backpropagation based on the loss function. In step S33, the trained model generation unit 16 stores the parameters of the learning model updated by error backpropagation based on the loss function. In step S34, the trained model generation unit 16 determines whether a predetermined termination condition is satisfied. If the predetermined termination condition is met, the learning for the learning model is completed and the learning model is set as a trained model. If the predetermined termination condition is not met, the process returns to step S30 and learning is repeated.

以上説明した通り、実施形態の学習装置１０は、領域情報取得部１１と、構造情報取得部１４と、学習用データセット生成部１５と、学習済みモデル生成部１６とを備える。領域情報取得部１１は、学習用画像に含まれる文字と矩形とのそれぞれの領域を示す領域情報を取得する。構造情報取得部１４は、学習用画像に含まれる矩形の階層構造を示す構造情報を取得する。学習用データセット生成部１５は、領域情報及び構造情報に基づいて、学習用画像に含まれる矩形のうち着目矩形に関する情報を入力用データとし、着目矩形の階層構造を教師データとする学習用データセットを生成する。学習済みモデル生成部１６は、学習用データセットを用いて学習モデルに学習させた学習結果として、入力された画像に含まれる矩形における構造情報を出力する学習済みモデルを生成する。これにより、実施形態の学習装置１０は、矩形を含む文書のレイアウトを変換するために必要な情報、つまり、矩形の親ＩＤを推定する学習済みモデルを生成することができる。 As described above, the learning device 10 of the embodiment includes the region information acquisition section 11, the structure information acquisition section 14, the learning dataset generation section 15, and the learned model generation section 16. The area information acquisition unit 11 acquires area information indicating respective areas of characters and rectangles included in the learning image. The structural information acquisition unit 14 acquires structural information indicating a hierarchical structure of rectangles included in the learning image. The learning data set generation unit 15 generates learning data based on the region information and structure information, using information regarding a rectangle of interest among the rectangles included in the learning image as input data, and using the hierarchical structure of the rectangle of interest as training data. Generate a set. The trained model generation unit 16 generates a trained model that outputs structural information on a rectangle included in the input image as a learning result of training the learning model using the learning data set. Thereby, the learning device 10 of the embodiment can generate information necessary for converting the layout of a document including a rectangle, that is, a trained model that estimates the parent ID of the rectangle.

また、実施形態の学習装置１０では、学習用データセット生成部１５は、着目矩形、着目矩形の位置から所定の第１範囲内に位置する文字、及び着目矩形の位置から所定の第２範囲内に位置する矩形のそれぞれの矩形情報（「領域情報」の一例）、及び意味タグ情報（「領域情報」の一例）を前記入力用データとする。これにより、実施形態の学習装置１０は、着目矩形とその近傍にある矩形及び文字の情報に基づいて、親ＩＤを推定するように学習モデルに学習させることが可能となり、より精度よく親ＩＤを推定する学習済みモデルを生成することができる。 Furthermore, in the learning device 10 of the embodiment, the learning data set generation unit 15 generates a rectangle of interest, a character located within a predetermined first range from the position of the rectangle of interest, and a character located within a predetermined second range from the position of the rectangle of interest. Rectangle information (an example of "area information") and semantic tag information (an example of "area information") of each rectangle located in are used as the input data. As a result, the learning device 10 of the embodiment can make the learning model learn to estimate the parent ID based on the rectangle of interest and the information of the rectangles and characters in its vicinity, so that the learning device 10 can more accurately estimate the parent ID. A trained model for estimation can be generated.

また、実施形態の学習装置１０では、学習用データセット生成部１５は、入力用データに用いる領域情報に示される文字または矩形それぞれの位置に応じて、入力用データを学習モデルに入力させる順序を決定する。これにより、実施形態の学習装置１０は、入力用データを学習モデルに入力させる順序に情報を持たせることができ、順序を情報として捉える学習モデル、例えばＲＮＮを用いて、入力の順序を考慮した学習をさせることが可能となり、より精度よく親ＩＤを推定する学習済みモデルを生成することができる。 Further, in the learning device 10 of the embodiment, the learning data set generation unit 15 determines the order in which the input data is input to the learning model according to the position of each character or rectangle indicated in the area information used for the input data. decide. As a result, the learning device 10 of the embodiment can have information on the order in which the input data is input to the learning model, and uses a learning model that takes the order as information, such as an RNN, to consider the input order. It becomes possible to perform learning, and it is possible to generate a trained model that estimates the parent ID with higher accuracy.

また、実施形態の学習装置１０では、領域情報に基づいて、学習用画像に含まれる文字の領域に示される第１文字に対応する特定の第２文字を含む意味タグ情報を生成する意味タグ情報生成部１３を更に備え、学習用データセット生成部１５は、領域情報に示される文字について、当該文字の前記意味タグ情報を入力用データに用いる。これにより、実施形態の学習装置１０は、学習用画像に示されている文字について、その意味に応じたタグ付けを行うことができ、学習モデルへの学習を、タグ付けを行わない場合と比較して、簡単にして処理負担を軽減させることが可能である。 In addition, in the learning device 10 of the embodiment, semantic tag information that generates semantic tag information including a specific second character corresponding to the first character shown in the character area included in the learning image based on the area information. The learning data set generating section 15 further includes a generating section 13, and the learning data set generating section 15 uses the meaning tag information of the character indicated in the region information as input data. Thereby, the learning device 10 of the embodiment can tag characters shown in the learning image according to their meanings, and compare learning to the learning model with a case where no tagging is performed. By doing so, it is possible to simplify the process and reduce the processing load.

上述した実施形態における学習装置１０の全部または一部をコンピュータで実現するようにしてもよい。その場合、この機能を実現するためのプログラムをコンピュータ読み取り可能な記録媒体に記録して、この記録媒体に記録されたプログラムをコンピュータシステムに読み込ませ、実行することによって実現してもよい。なお、ここでいう「コンピュータシステム」とは、ＯＳや周辺機器等のハードウェアを含むものとする。また、「コンピュータ読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ－ＲＯＭ等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置のことをいう。さらに「コンピュータ読み取り可能な記録媒体」とは、インターネット等のネットワークや電話回線等の通信回線を介してプログラムを送信する場合の通信線のように、短時間の間、動的にプログラムを保持するもの、その場合のサーバやクライアントとなるコンピュータシステム内部の揮発性メモリのように、一定時間プログラムを保持しているものも含んでもよい。また上記プログラムは、前述した機能の一部を実現するためのものであってもよく、さらに前述した機能をコンピュータシステムにすでに記録されているプログラムとの組み合わせで実現できるものであってもよく、ＦＰＧＡ等のプログラマブルロジックデバイスを用いて実現されるものであってもよい。 All or part of the learning device 10 in the embodiment described above may be realized by a computer. In that case, a program for realizing this function may be recorded on a computer-readable recording medium, and the program recorded on the recording medium may be read into a computer system and executed. Note that the "computer system" herein includes hardware such as an OS and peripheral devices. Furthermore, the term "computer-readable recording medium" refers to portable media such as flexible disks, magneto-optical disks, ROMs, and CD-ROMs, and storage devices such as hard disks built into computer systems. Furthermore, a "computer-readable recording medium" refers to a storage medium that dynamically stores a program for a short period of time, such as a communication line when transmitting a program via a network such as the Internet or a communication line such as a telephone line. It may also include a device that retains a program for a certain period of time, such as a volatile memory inside a computer system that is a server or client in that case. Further, the above-mentioned program may be one for realizing a part of the above-mentioned functions, or may be one that can realize the above-mentioned functions in combination with a program already recorded in the computer system. It may also be realized using a programmable logic device such as an FPGA.

以上、この発明の実施形態について図面を参照して詳述してきたが、具体的な構成はこの実施形態に限られるものではなく、この発明の要旨を逸脱しない範囲の設計等も含まれる。 Although the embodiments of the present invention have been described above in detail with reference to the drawings, the specific configuration is not limited to these embodiments, and includes designs within the scope of the gist of the present invention.

１０…学習装置
１１…領域情報取得部
１２…矩形情報生成部
１３…意味タグ情報生成部
１４…構造情報取得部
１５…学習用データセット生成部
１６…学習済みモデル生成部
１７…記憶部
１７０…変換テーブル
１７１…矩形情報
１７２…意味タグ情報
１７３…学習用データセット
１７４…学習済みモデル 10...Learning device 11...Region information acquisition unit 12...Rectangle information generation unit 13...Semantic tag information generation unit 14...Structure information acquisition unit 15...Learning dataset generation unit 16...Learned model generation unit 17...Storage unit 170... Conversion table 171...Rectangle information 172...Semantic tag information 173...Learning dataset 174...Learned model

Claims

an area information acquisition unit that acquires area information indicating respective areas of characters and rectangles included in the learning image;
a structural information acquisition unit that acquires structural information indicating a hierarchical structure of rectangles included in the learning image;
Learning to generate a learning data set based on the area information and the structure information, using information regarding a rectangle of interest among the rectangles included in the learning image as input data, and using a hierarchical structure of the rectangle of interest as training data. a data set generation unit for
a trained model generation unit that generates a trained model that outputs the structural information in a rectangle included in the input image as a learning result of training the learning model using the training data set;
A learning device equipped with.

The learning data set generation unit generates each of the rectangle of interest, characters located within a predetermined first range from the position of the rectangle of interest, and rectangles located within a second predetermined range from the position of the rectangle of interest. using the area information as the input data;
The learning device according to claim 1.

Further comprising a semantic tag information generation unit that generates semantic tag information including a specific second character corresponding to the first character shown in the character area included in the learning image based on the area information,
The learning data set generation unit uses the meaning tag information of the character indicated in the region information as input data;
The learning device according to claim 1 or claim 2 .

The area information acquisition unit acquires area information indicating respective areas of characters and rectangles included in the learning image,
a structural information acquisition unit acquires structural information indicating a hierarchical structure of rectangles included in the learning image;
A learning data set generation unit uses a plurality of pieces of information regarding a rectangle of interest selected from rectangles included in the learning image as input data, based on the area information and the structure information, and generates a hierarchical structure of the rectangle of interest. Generate a training dataset to be used as training data,
a trained model generation unit generates a trained model that outputs the structural information in a rectangle included in the input image as a learning result of training the learning model using the training data set;
How to learn.

A program for causing a computer to operate as the learning device according to any one of claims 1 to 3 , the program causing the computer to function as each part included in the learning device.