JP7041103B2

JP7041103B2 - Structured document creation device and its method

Info

Publication number: JP7041103B2
Application number: JP2019155289A
Authority: JP
Inventors: 優太松尾; 水萌勝野; 央樹松岡; 健一郎国末
Original assignee: Nippon Telegraph and Telephone West Corp
Current assignee: Nippon Telegraph and Telephone West Corp
Priority date: 2019-08-28
Filing date: 2019-08-28
Publication date: 2022-03-23
Anticipated expiration: 2039-08-28
Also published as: JP2021033804A

Description

本発明は、非構造化文書から構造化文書を作成する構造化文書作成装置とその方法に関する。 The present invention relates to a structured document creating apparatus for creating a structured document from an unstructured document and a method thereof.

マニュアル等のナレッジ・文書から情報を抽出する技術がある。又抽出した情報を元に機械学習を用いた情報処理を行い質問に対する回答を生成する技術がある。効果的な機械学習を行うには、情報を構造化（タグ付け）する必要がある。 There is a technology to extract information from knowledge and documents such as manuals. There is also a technology that processes information using machine learning based on the extracted information and generates answers to questions. Information needs to be structured (tagged) for effective machine learning.

ただし、例えば製品マニュアル等の文書形式は、非構造化文書（タグなしＰＤＦ）であることが多い。したがって、機械学習を用いた情報処理を行うためには、非構造化文書を構造化する事前処理が必要である。 However, the document format such as a product manual is often an unstructured document (untagged PDF). Therefore, in order to perform information processing using machine learning, pre-processing for structuring an unstructured document is required.

ファイル形式に着目すれば、非構造化文書から構造化文書を作成する技術として光学的文字認識（ＯＣＲ）が存在する。しかし、ＯＣＲは、レイアウトの再現に止まり、章及び節等の文書の構造を表すセクション情報の取得は行えない。 Focusing on the file format, there is optical character recognition (OCR) as a technique for creating a structured document from an unstructured document. However, OCR can only reproduce the layout and cannot acquire section information representing the structure of the document such as chapters and sections.

構造化文書を作成する構造化文書作成装置は、例えば特許文献１に開示されている。 A structured document creating apparatus for creating a structured document is disclosed in, for example, Patent Document 1.

特開２０１０－１８６３２５号公報Japanese Unexamined Patent Publication No. 2010-186325

しかしながら、特許文献１に開示された技術は、構造化する対象の文書が事業報告書に限定される。事業報告書のフォーマットに適合しない文書は構造化することができない。つまり、従来の技術は汎用性が低いという課題がある。 However, in the technology disclosed in Patent Document 1, the document to be structured is limited to the business report. Documents that do not conform to the business report format cannot be structured. That is, the conventional technique has a problem of low versatility.

本発明は、この課題に鑑みてなされたものであり、区分された非構造化文書であればあらゆる種類の文書に対応できる汎用性の高い構造化文書作成装置とその方法を提供することを目的とする。区分された非構造化文書は、例えば、マニュアル、取り扱い説明書、及び仕様書等である。区分されていない非構造化文書は例えば随筆等である。 The present invention has been made in view of this problem, and an object of the present invention is to provide a highly versatile structured document creation device and a method thereof that can handle all kinds of documents as long as they are classified unstructured documents. And. The classified unstructured documents are, for example, manuals, instruction manuals, specifications, and the like. Unstructured documents that are not classified are, for example, essays.

本発明の構造化文書作成装置は、区分された非構造化文書から、所定の型に適合するように記述された構造化文書を作成する構造化文書作成装置であって、前記非構造化文書の頁ごとに、テキストフレーム要素、表フレーム要素、及び画像フレーム要素を取得する文書構造取得部と、前記テキストフレーム要素、前記表フレーム要素、及び前記画像フレーム要素のそれぞれの前記頁における位置とその内容とを対応付けてリスト化した管理リストを生成する管理リスト生成部と、前記管理リストを参照し、前記頁ごとに前記テキストフレーム要素を書き順に並べるテキストフレーム要素配列部と、並べた前記テキストフレーム要素を、分割定義プロパティに設定されたルールに基づいて分割した第１構造化文書を生成するテキスト情報構造化部と、前記管理リストを参照して前記第１構造化文書に前記表フレーム要素と前記画像フレーム要素を挿入したＨＴＭＬファイルを生成するＨＴＭＬ化部とを備え、前記ルールは、フォントサイズ、文字位置、及び見出しを識別する規則を含むものである。 The structured document creation device of the present invention is a structured document creation device that creates a structured document described so as to conform to a predetermined type from a classified unstructured document, and is the unstructured document. For each page, a document structure acquisition unit that acquires a text frame element, a table frame element, and an image frame element, and the positions of the text frame element, the table frame element, and the image frame element on the page and their positions thereof. A management list generation unit that generates a management list that is listed in association with the contents, a text frame element arrangement unit that refers to the management list and arranges the text frame elements in the writing order for each page, and the arranged text. The text information structuring unit that generates the first structured document in which the frame element is divided based on the rule set in the division definition property, and the table frame element in the first structured document with reference to the management list. The rule includes a rule for identifying a font size, a character position, and a heading .

また、本発明の構造化文書作成方法は、上記の構造化文書作成装置が行う構造化文書作成方法であって、前記非構造化文書の頁ごとに、テキストフレーム要素、表フレーム要素、及び画像フレーム要素を取得する文書構造取得ステップと、前記テキストフレーム要素、前記表フレーム要素、及び前記画像フレーム要素のそれぞれの前記頁における位置とその内容とを対応付けてリスト化した管理リストを生成する管理リスト生成ステップと、前記管理リストを参照し、前記頁ごとに前記テキストフレーム要素を書き順に並べるテキストフレーム要素配列ステップと、並べた前記テキストフレーム要素を、分割定義プロパティに設定されたルールに基づいて分割した第１構造化文書を生成するテキスト情報構造化ステップと、前記管理リストを参照して前記第１構造化文書に前記表フレーム要素と前記画像フレーム要素を挿入したＨＴＭＬファイルを生成するＨＴＭＬ化ステップとを行い前記ルールは、フォントサイズ、文字位置、及び見出しを識別する規則を含むものである。 Further, the structured document creation method of the present invention is a structured document creation method performed by the above-mentioned structured document creation apparatus, and a text frame element, a table frame element, and an image are used for each page of the unstructured document. A management to generate a management list in which the document structure acquisition step for acquiring a frame element and the position of each of the text frame element, the table frame element, and the image frame element on the page and their contents are associated with each other and listed. Based on the rule set in the division definition property, the list generation step, the text frame element array step that refers to the management list and arranges the text frame elements for each page in the writing order, and the arranged text frame elements. A text information structuring step to generate a divided first structured document , and an HTML file to generate an HTML file in which the table frame element and the image frame element are inserted into the first structured document with reference to the management list. The rules include a rule for identifying font size, character position, and heading .

本発明によれば、区分された非構造化文書であればあらゆる種類の文書に対応できる汎用性の高い構造化文書作成装置とその方法を提供することができる。 INDUSTRIAL APPLICABILITY According to the present invention, it is possible to provide a highly versatile structured document creation device and a method thereof that can handle all kinds of documents as long as they are classified unstructured documents.

本発明の第１実施形態に係る構造化文書作成装置の機能構成例を示す図である。It is a figure which shows the functional structure example of the structured document making apparatus which concerns on 1st Embodiment of this invention. 図１に示す構造化文書作成装置の処理手順を示すフローチャートである。It is a flowchart which shows the processing procedure of the structured document making apparatus shown in FIG. 区分された非構造化文書の一頁を模式的に示す図である。It is a figure which shows typically one page of the divided unstructured document. 図１に示す文書構造取得部で取得したテキストフレーム要素の一部分の例を示す図である。It is a figure which shows the example of a part of the text frame element acquired by the document structure acquisition part shown in FIG. 図１に示す管理リスト生成部が生成したテキスト管理リストの例を示す図である。It is a figure which shows the example of the text management list generated by the management list generation part shown in FIG. 図１に示す管理リスト生成部が生成した画像管理リストの一例を示す図である。It is a figure which shows an example of the image management list generated by the management list generation part shown in FIG. 図１に示す管理リスト生成部が生成したテーブル管理リストの例を示す図である。It is a figure which shows the example of the table management list generated by the management list generation part shown in FIG. 図１に示す分割定義プロパティに設定された分割ルールの一例を示す図である。It is a figure which shows an example of the division rule set in the division definition property shown in FIG. 図１に示すテキスト情報構造化部が生成する第１構造化文書の例を示す図である。It is a figure which shows the example of the 1st structured document generated by the text information structuring part shown in FIG. 本発明の第２実施形態に係る構造化文書作成装置の機能構成例を示す図である。It is a figure which shows the functional structure example of the structured document making apparatus which concerns on 2nd Embodiment of this invention. 図１０に示す構造化文書作成装置の処理手順を示すフローチャートである。It is a flowchart which shows the processing procedure of the structured document making apparatus shown in FIG. 図１０に示す分割部が行う処理を模式的に示す図である。It is a figure which shows typically the process performed by the division part shown in FIG.

以下、本発明の実施の形態について図面を用いて説明する。複数の図面中同一のものには同じ参照符号を付し、説明は繰り返さない。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. The same reference numerals are given to the same objects in a plurality of drawings, and the description is not repeated.

〔第１実施形態〕
図１は、本発明の第１実施形態に係る構造化文書作成装置の機能構成例を示す図である。図１に示す構造化文書作成装置１００は、区分された非構造化文書から、所定の型に適合するように記述された構造化文書を作成する装置である。 [First Embodiment]
FIG. 1 is a diagram showing a functional configuration example of the structured document creating device according to the first embodiment of the present invention. The structured document creation device 100 shown in FIG. 1 is a device that creates a structured document described so as to conform to a predetermined type from a classified unstructured document.

構造化文書作成装置１００は、ＰＤＦ生成部１０、文書構造取得部２０、管理リスト生成部３０、テキストフレーム要素配列部４０、テキスト情報構造化部５０、ＨＴＭＬ化部６０、操作部７０、及び表示部８０を備える。構造化文書作成装置１００は、例えば、ＲＯＭ、ＲＡＭ、ＣＰＵ等からなるコンピュータで実現することができる。各機能構成部をコンピュータによって実現する場合、各機能構成部が有すべき機能の処理内容はプログラムによって記述される。このことは後述する他の実施形態でも同じである。 The structured document creation device 100 includes a PDF generation unit 10, a document structure acquisition unit 20, a management list generation unit 30, a text frame element arrangement unit 40, a text information structuring unit 50, an HTML organization unit 60, an operation unit 70, and a display. A unit 80 is provided. The structured document creation device 100 can be realized by, for example, a computer including a ROM, a RAM, a CPU, and the like. When each functional component is realized by a computer, the processing content of the function that each functional component should have is described by a program. This also applies to other embodiments described later.

なお、図１において、他の装置とネットワーク（図示せず）を介して通信する通信部の表記は省略している。これらの一般的な機能構成部である通信部（図示せず）、操作部７０、及び表示部８０の説明は省略する。 In FIG. 1, the notation of the communication unit that communicates with other devices via a network (not shown) is omitted. The description of the communication unit (not shown), the operation unit 70, and the display unit 80, which are general functional components, will be omitted.

ＰＤＦ生成部１０は、例えば文書作成するアプリケーションソフトウェアで構成され、ユーザが作成した文書をＰＤＦ（Portable Document Format）ファイル構造で生成する。ＰＤＦファイル構造は、ページツリーとアウトラインツリーで構成される。ページツリーは、ページごとの内容とサムネール（縮小させた見本）を含む。アウトラインツリーは、文書の章及び節を示すしおり情報等を含む。以降、ページは頁と表記する。 The PDF generation unit 10 is composed of, for example, application software for creating a document, and generates a document created by a user in a PDF (Portable Document Format) file structure. The PDF file structure consists of a page tree and an outline tree. The page tree contains page-by-page content and thumbnails (reduced samples). The outline tree contains bookmark information and the like indicating chapters and sections of the document. Hereinafter, the page is referred to as a page.

図２は、構造化文書作成装置１００の処理手順を示すフローチャートである。ここから図１と図２を参照して構造化文書作成装置１００の動作を説明する。 FIG. 2 is a flowchart showing a processing procedure of the structured document creating apparatus 100. From here, the operation of the structured document creation device 100 will be described with reference to FIGS. 1 and 2.

文書構造取得部２０は、区分された非構造化文書の頁ごとに、テキストフレーム要素、表フレーム要素、及び画像フレーム要素を取得する（ステップＳ１）。区分された非構造化文書は、章及び節等の構成が明らかな文書のことである。 The document structure acquisition unit 20 acquires a text frame element, a table frame element, and an image frame element for each page of the divided unstructured document (step S1). A classified unstructured document is a document whose structure such as chapters and sections is clear.

図３は、本実施形態が対象とする区分された非構造化文書の一頁を模式的に示す図である。図３に示すように、非構造化文書の一頁Ｐは、タイトルＴ、サブタイトルＳＴ、本文Ｃ、図Ｚ、及び表Ｈを含む。なお、本文Ｃのみで構成される（図Ｚ等を含まない）頁Ｐもあり得る。 FIG. 3 is a diagram schematically showing one page of the divided unstructured document targeted by the present embodiment. As shown in FIG. 3, one page P of an unstructured document includes a title T, a subtitle ST, a text C, a figure Z, and a table H. It should be noted that there may be a page P composed of only the text C (not including FIG. Z and the like).

図３において、テキストフレーム要素は、タイトルＴ、サブタイトルＳＴ、及び本文Ｃである。また、表フレーム要素は表Ｈである。また、画像フレームは図Ｚである。それぞれをフレーム要素と称しているのは、頁Ｐの上にそれぞれが所定の範囲（座標）に記載されていることによる。 In FIG. 3, the text frame elements are the title T, the subtitle ST, and the text C. The table frame element is table H. The image frame is shown in FIG. Z. Each is referred to as a frame element because each is described in a predetermined range (coordinates) on page P.

図４は、文書構造取得部２０で取得されたテキストフレーム要素の一部を示す。テキストフレーム要素は、ＸＭＬ（eXtensible Markup Language）形式で表される。 FIG. 4 shows a part of the text frame element acquired by the document structure acquisition unit 20. Text frame elements are represented in XML (eXtensible Markup Language) format.

図４に示す各行のそれぞれは、文書を構成するフォントを表す。1～2行目はフォントの「こ」、3～4行目は同「の」、5～6行目は同「た」である。7行目以降は省略する。 Each of the lines shown in FIG. 4 represents the font that constitutes the document. The 1st and 2nd lines are the font "ko", the 3rd and 4th lines are the same "no", and the 5th and 6th lines are the same "ta". The 7th and subsequent lines are omitted.

図４に記載されたahp:lはフォントの左位置、ahp:rは同右位置、ahp:tは同上位置、ahp:bは同下位置の頁Ｐ上の座標を表す。 In FIG. 4, ahp: l represents the left position of the font, ahp: r represents the same right position, ahp: t represents the same top position, and ahp: b represents the coordinates on the page P at the same bottom position.

管理リスト生成部３０は、テキストフレーム要素、表フレーム要素、及び画像フレーム要素のそれぞれの頁Ｐにおける位置（座標）とその内容とを対応付けてリスト化した管理リストを生成する（ステップＳ２）。テキストフレーム要素は、ＸＭＬで表されるフォントの出現順を利用して文字列を取得する。また、その文字列の位置も左位置ahp:l～下位置ahp:bで取得する。 The management list generation unit 30 generates a management list in which the positions (coordinates) of the text frame element, the table frame element, and the image frame element on each page P and their contents are associated and listed (step S2). The text frame element acquires a character string by using the appearance order of the font represented by XML. Also, the position of the character string is acquired from the left position ahp: l to the lower position ahp: b.

図４に示す「このたびは、」のテキストフレーム要素の位置は、左位置ahp:l=62.32、右位置ahp:r=124.12、上位置ahp:t=739.70、下位置ahp:b=749.70である。管理リスト生成部３０は、テキストフレーム要素の位置とその内容を対応付けてリスト化する。 The positions of the text frame elements of "This time" shown in Fig. 4 are the left position ahp: l = 62.32, the right position ahp: r = 124.12, the upper position ahp: t = 739.70, and the lower position ahp: b = 749.70. be. The management list generation unit 30 lists the positions of the text frame elements and their contents in association with each other.

図５は、テキストフレーム要素をリスト化したテキスト管理リスト３１の例を示す図である。テキスト管理リスト３１は、例えば、文字（文字列）、文字列長、ページ、左位置、上位置、右位置、下位置、オペレーションＮｏ、及び書き順の項目から成る。オペレーションＮｏと書き順は、文字の出現順番を表す情報である。 FIG. 5 is a diagram showing an example of a text management list 31 in which text frame elements are listed. The text management list 31 includes, for example, a character (character string), a character string length, a page, a left position, an upper position, a right position, a lower position, an operation number, and a stroke order item. The operation No. and the stroke order are information indicating the order of appearance of characters.

管理リスト生成部３０は、頁Ｐ内の図と表についてもテキストフレーム要素と同様にそれぞれの位置とその内容を対応付けて画像管理リスト３２とテーブル管理リスト３３を生成する。 The management list generation unit 30 generates the image management list 32 and the table management list 33 by associating the positions of the figures and tables in the page P with their respective positions and their contents in the same manner as the text frame elements.

図６は、画像フレーム要素をリスト化した画像管理リスト３２の例を示す図である。画像管理リスト３２は、画像、ページ、左位置、上位置、右位置、及び下位置の項目からなる。画像は、画像ファイル名と画像ファイルパスで構成される。 FIG. 6 is a diagram showing an example of an image management list 32 in which image frame elements are listed. The image management list 32 includes items of an image, a page, a left position, an upper position, a right position, and a lower position. The image consists of an image file name and an image file path.

図７は、表フレーム要素をリスト化したテーブル管理リスト３３の例を示す図である。テーブル管理リスト３３は、表ＩＤ、ページ、左位置、上位置、右位置、及び下位置の項目からなる。 FIG. 7 is a diagram showing an example of a table management list 33 in which table frame elements are listed. The table management list 33 includes items of a table ID, a page, a left position, an upper position, a right position, and a lower position.

テキスト管理リスト３１、画像管理リスト３２、及びテーブル管理リスト３３のそれぞれを参照すれば頁Ｐに記載されたコンテンツを全て再現することができる。 By referring to each of the text management list 31, the image management list 32, and the table management list 33, all the contents described on the page P can be reproduced.

テキストフレーム要素配列部４０は、テキスト管理リスト３１を参照し、テキストフレーム要素を頁ごとに並べ替える（ステップＳ３）。例えば、図３に示したように、タイトルＴ、サブタイトルＳＴ、本文Ｃの順番でテキストフレーム要素を並べる。 The text frame element arrangement unit 40 refers to the text management list 31 and rearranges the text frame elements page by page (step S3). For example, as shown in FIG. 3, the text frame elements are arranged in the order of title T, subtitle ST, and body C.

テキスト情報構造化部５０は、テキストフレーム要素配列部４０で並べたテキストフレーム要素を、分割定義プロパティ５１に設定されたフォントサイズ、文字位置、所定の規則を表す正規表現、及びしおり情報の何れかのルールに基づいて構造化した第１構造化文書を生成する（ステップＳ４）。分割定義プロパティ５１は予めユーザが設定する情報であり、例えば、しおり情報に基づいてテキストフレーム情報を並べる、フォントサイズに応じてタイトルＴとサブタイトルＳＴを切り分ける等のルールを定めたものである。また、正規表現の具体例については後述する。 The text information structuring unit 50 arranges the text frame elements arranged by the text frame element arrangement unit 40 with any one of the font size, the character position, the regular expression representing a predetermined rule, and the bookmark information set in the division definition property 51. Generate a first structured document structured based on the rule of (step S4). The division definition property 51 is information set by the user in advance, and defines rules such as arranging text frame information based on bookmark information and separating title T and subtitle ST according to the font size. A specific example of a regular expression will be described later.

図８は、分割定義プロパティ５１に設定されたルールの例を示す図である。図８に示すように、頁Ｐ上のx座標、y座標ともに小さいほど上位のセクション、フォントサイズが大きいほど上位のセクションにするといったルールが設定される。セクションは、文書の区切られた部分であり章＞節＞項のそれぞれを表す。 FIG. 8 is a diagram showing an example of a rule set in the division definition property 51. As shown in FIG. 8, a rule is set such that the smaller the x-coordinate and the y-coordinate on the page P, the higher the section, and the larger the font size, the higher the section. A section is a delimited part of a document and represents each chapter> section> section.

文字位置が（10，5）⇒<h1>を付与は、座標（10，5）に記載された文字列には見出しタグ<h1>を付与することを意味する。座標（10，5）は一例である。見出しタグ<h1>は、例えば章のタイトルに付与される。 Adding the character position (10, 5) ⇒ <h1> means adding the heading tag <h1> to the character string described in the coordinates (10, 5). The coordinates (10, 5) are an example. The heading tag <h1> is attached to the chapter title, for example.

フォントサイズが14p⇒<h2>を付与は、14pのフォントサイズの文字列には見出しタグ<h2>を付与することを意味する。（数字）で始まっている⇒<h3>を付与は、行の先頭が数字の場合に、見出しタグ<h3>を付与することを意味する。見出しタグは、<h6>まで用意されている。見出しタグで表される文字列の大きさは、<h1>＞<h2>＞<h3>…の関係である。 When the font size is 14p⇒ <h2>, it means that the heading tag <h2> is added to the character string of the font size of 14p. Starting with (number) ⇒ Adding <h3> means adding the heading tag <h3> when the beginning of the line is a number. Heading tags are available up to <h6>. The size of the character string represented by the heading tag is related to <h1 >> <h2 >> <h3> ....

図９は、テキスト情報構造化部５０が生成した第１構造化文書の例を示す図である。見出しタグの開始タグ<h1>と終了タグ</h1>に挟まれた要素は、「第３節光コラボレーション受付センタについて」である。この要素は、例えば「文字位置が（10，5）⇒<h1>を付与」のルールが適用され、当該頁Ｐの最上位の見出しになる。 FIG. 9 is a diagram showing an example of a first structured document generated by the text information structuring unit 50. The element between the start tag <h1> and the end tag </ h1> of the heading tag is "Section 3 About the Optical Collaboration Reception Center". For example, the rule that "character position is (10, 5) ⇒ <h1> is given" is applied to this element, and it becomes the top heading of the page P.

図９に示す2行目は、分割定義プロパティ５１に設定された「フォントサイズが14p⇒<h2>を付与」が適用され、「１. 光コラボレーション受付センタとは」を開始タグ<h2>と終了タグ</h2>で挟んで構造化されたことを示している。 In the second line shown in FIG. 9, "font size is 14p⇒ <h2> is given" set in the division definition property 51 is applied, and "1. What is the optical collaboration reception center" is started with the start tag <h2>. It indicates that it was structured by sandwiching it with the end tag </ h2>.

同3行目は、（数字）で始まっている⇒<h3>のルールが適用され、「（１）概要」を開始タグ<h3>と終了タグ</h3>で挟んで構造化されたことを示している。4行目～7行目は、テキストフレーム要素の本文Ｃがタグ<p>で挟まれて構造化されている。 The third line starts with (number) ⇒ The rule of <h3> is applied, and "(1) Overview" is structured by sandwiching the start tag <h3> and the end tag </ h3>. Is shown. The 4th to 7th lines are structured so that the body C of the text frame element is sandwiched between tags <p>.

このように、テキスト情報構造化部５０は、管理リストを参照して並べたテキストフレーム要素を、分割定義プロパティに設定されたルールに基づいて構造化する。よって、あらゆる種類の文書で有ってもルールに基づいて文書を構造化することができる。 In this way, the text information structuring unit 50 structures the text frame elements arranged with reference to the management list based on the rules set in the division definition property. Therefore, it is possible to structure a document based on a rule even if it is a document of any kind.

なお、ルールは、文字位置とフォントサイズに基づく例を示したが他にも考えられる。所定の規則を表す正規表現、又は非構造化文書に含まれるしおり情報に基づくルールを分割定義プロパティに設定するようにしてもよい。 The rule shows an example based on the character position and the font size, but other possibilities are possible. A regular expression representing a predetermined rule, or a rule based on bookmark information contained in an unstructured document may be set in the division definition property.

所定の規則を表す正規表現とは、例えば、見出しを識別する一定のルールのことである。例えば、セクションレベル１：（Ｉ－ＩＸ）、セクションレベル２：（１－９）といった正規表現が考えられる。 A regular expression that represents a given rule is, for example, a fixed rule that identifies a heading. For example, regular expressions such as section level 1: (I-IX) and section level 2: (1-9) can be considered.

セクションレベル１：（Ｉ－ＩＸ）は、（Ｉ）、（ＩＩ）、…、（Ｘ）のようにかっこに挟まれたローマ数字をセクションレベル１（例えば「章」）として構造化する正規表現である。また、セクションレベル２：（１－９）は、（１）、（２）、…、（１０）のようにかっこに挟まれたアラビヤ数字をセクションレベル２（例えば「節」）として構造化する正規表現である。 Section level 1: (I-IX) is a regular expression that structures Roman numerals in parentheses as section level 1 (eg, "chapter"), such as (I), (II), ..., (X). Is. Further, section level 2: (1-9) structures arabia numbers sandwiched between parentheses such as (1), (2), ..., (10) as section level 2 (for example, "section"). It is a regular expression.

また、非構造化文書のＰＤＦファイル構造のアウトラインツリーに含まれるしおり情報に基づいて、管理リストを参照して並べたテキストフレーム要素を構造化してもよい。 Further, the text frame elements arranged by referring to the management list may be structured based on the bookmark information included in the outline tree of the PDF file structure of the unstructured document.

ＨＴＭＬ化部６０は、テキスト情報構造化部５０が生成した第１構造化文書に、図Ｚと表Ｈを挿入したＨＴＭＬファイルを生成する（ステップＳ５）。図Ｚは、画像管理リスト３２（図６）を参照して第１構造化文書に挿入される。表Ｈは、テーブル管理リスト３３（図７）を参照して第１構造化文書に挿入される。 The HTML-forming unit 60 generates an HTML file in which FIG. Z and Table H are inserted into the first structured document generated by the text information structuring unit 50 (step S5). FIG. Z is inserted into the first structured document with reference to the image management list 32 (FIG. 6). Table H is inserted into the first structured document with reference to the table management list 33 (FIG. 7).

以上説明したように本実施形態に係る構造化文書作成装置１００は、区分された非構造化文書から、所定の型に適合するように記述された構造化文書を作成する構造化文書作成装置であって、非構造化文書の頁ごとに、テキストフレーム要素、表フレーム要素、及び画像フレーム要素を取得する文書構造取得部２０と、テキストフレーム要素、表フレーム要素、及び画像フレーム要素のそれぞれの頁における位置とその内容とを対応付けてリスト化した管理リストを生成する管理リスト生成部３０と、管理リストを参照し、テキストフレーム要素を頁ごとに並べるテキストフレーム要素配列部４０と、並べたテキストフレーム要素を、分割定義プロパティ５１に設定されたルールに基づいて構造化した第１構造化文書を生成するテキスト情報構造化部５０とを備える。これにより、区分された非構造化文書であればあらゆる種類の文書に対応できる汎用性の高い構造化文書作成装置とその方法を提供することができる。 As described above, the structured document creation device 100 according to the present embodiment is a structured document creation device that creates a structured document described so as to conform to a predetermined type from a classified unstructured document. Therefore, for each page of the unstructured document, the document structure acquisition unit 20 for acquiring the text frame element, the table frame element, and the image frame element, and each page of the text frame element, the table frame element, and the image frame element. A management list generation unit 30 that generates a management list in which the positions in the above are associated with each other and their contents are listed, a text frame element arrangement unit 40 that refers to the management list and arranges text frame elements page by page, and an arranged text. The frame element is provided with a text information structuring unit 50 that generates a first structured document in which the frame element is structured based on the rule set in the division definition property 51. This makes it possible to provide a highly versatile structured document creation device and a method thereof that can handle all kinds of documents as long as they are classified unstructured documents.

〔第２実施形態〕
図１０は、本発明の第２実施形態に係る構造化文書作成装置の機能構成例を示す図である。図１０に示す構造化文書作成装置２００は、分割部２１０とＨＴＭＬ化部２６０を備える点で構造化文書作成装置１００（図１）と異なる。図１１は、構造化文書作成装置２００の処理手順を示すフローチャートである。 [Second Embodiment]
FIG. 10 is a diagram showing a functional configuration example of the structured document creation device according to the second embodiment of the present invention. The structured document creating device 200 shown in FIG. 10 differs from the structured document creating device 100 (FIG. 1) in that it includes a dividing unit 210 and an HTML-forming unit 260. FIG. 11 is a flowchart showing a processing procedure of the structured document creating apparatus 200.

分割部２１０は、テキスト情報構造化部５０が生成した第１構造化文書を構成するテキストに含まれる複数の文のそれぞれをベクトル化して前後の類似度を求め、該類似度と分割定義プロパティ５１に設定された閾値とを比較して上記のテキストを分割した第２構造化文書を生成する（ステップＳ６）。 The division unit 210 vectorizes each of a plurality of sentences included in the text constituting the first structured document generated by the text information structuring unit 50 to obtain the similarity before and after, and determines the similarity and the division definition property 51. A second structured document in which the above text is divided is generated by comparing with the threshold value set in (step S6).

テキストに含まれる複数の文は、例えば周知のＢｏＷ（bag-of-words）を用いてベクトル化する。ＢｏＷは、一文を形態素解析して分割した単語にユニークな数値を割り当て、全ての単語をone hot vectorに変換する。one hot vectorは、上記のテキスト内に存在する単語が1とされ、以外が0とされたベクトルである。ＢｏＷは周知であり、これ以上の説明は省略する。 A plurality of sentences contained in the text are vectorized using, for example, a well-known BoW (bag-of-words). BoW morphologically analyzes a sentence, assigns a unique numerical value to the divided words, and converts all the words into one hot vector. One hot vector is a vector in which the words existing in the above text are set to 1 and the words other than the above are set to 0. BoW is well known and further description is omitted.

分割部２１０は、ベクトル化した文の前後の類似度を求める。そして、分割部２１０は、分割定義プロパティ５１に設定された閾値と類似度を比較して第１構造化文書を構成するテキストを分割した第２構造化文書を生成する。テキストを分割する閾値は、例えば0.2とする。この場合、類似度が0.2よりも小さい場合は前後する一文間の関連性が低いと判定し、類似度が0.2になる一文の前後でテキストを分割して構造化する。 The division unit 210 obtains the similarity before and after the vectorized sentence. Then, the division unit 210 compares the threshold value set in the division definition property 51 with the degree of similarity to generate a second structured document in which the text constituting the first structured document is divided. The threshold for dividing the text is, for example, 0.2. In this case, if the similarity is smaller than 0.2, it is determined that the relevance between the preceding and following sentences is low, and the text is divided and structured before and after the sentence having the similarity of 0.2.

図１２は、テキストを分割する様子を模式的に示す図である。図１２に示す様に、第１構造化文書を構成するテキストは、例えば5つの文から成り、一文ａと次の一文ｂの類似度は0.88である。一文ｂと一文ｃの類似度は0.75である。一文ｃと一文ｄの類似度は0.11である。一文ｄと一文ｅの類似度は0.97である。 FIG. 12 is a diagram schematically showing how the text is divided. As shown in FIG. 12, the text constituting the first structured document is composed of, for example, five sentences, and the similarity between one sentence a and the next sentence b is 0.88. The similarity between one sentence b and one sentence c is 0.75. The similarity between one sentence c and one sentence d is 0.11. The similarity between one sentence d and one sentence e is 0.97.

この場合、分割部２１０は、一文ｃとｄの間で分割して構造化した第２構造化文書を生成する。このようにテキストを、関連性の低い一文の間で分割した第２構造化文書は、分割された一群のテキストの意味を明確にすることができる。 In this case, the division unit 210 generates a second structured document divided and structured between the sentences c and d. The second structured document in which the text is divided into irrelevant sentences in this way can clarify the meaning of the divided group of texts.

以上説明したように本実施形態に係る構造化文書作成装置２００は、第１構造化文書を構成する分割されたテキストを構成する一文のそれぞれをベクトル化して前後の類似度を求め、該類似度と分割定義プロパティ５１に設定された閾値とを比較し、類似度が閾値よりも小さい場合はテキストを分割した第２構造化文書を生成する分割部２１０と、画像管理リスト３２とテーブル管理リスト３３を参照し、第２構造化文書に画像及び表を挿入したＨＴＭＬファイルを生成するＨＴＭＬ化部２６０とを備える。これにより、テキストを適切な長さに分割することができる。 As described above, the structured document creation device 200 according to the present embodiment vectorizes each of the sentences constituting the divided text constituting the first structured document to obtain the similarity before and after, and obtains the similarity before and after. And the threshold value set in the division definition property 51, and if the similarity is smaller than the threshold value, the division unit 210 that generates the second structured document in which the text is divided, the image management list 32, and the table management list 33. The second structured document is provided with an HTML conversion unit 260 for generating an HTML file in which an image and a table are inserted. This allows the text to be split into appropriate lengths.

（変形例）
第１構造化文書を分割するだけでなく、第１構造化文書を構成するテキストを結合させても良い。テキストを結合させる場合の閾値は、例えば0.8とする。この場合、類似度が0.8よりも大きな一文は同じ段落を表すものとして結合して構造化する。 (Modification example)
Not only the first structured document may be divided, but also the texts constituting the first structured document may be combined. The threshold value when combining texts is, for example, 0.8. In this case, sentences with a similarity greater than 0.8 are combined and structured as representing the same paragraph.

第１構造化文書を構成する分割されたテキストを構成する一文のそれぞれをベクトル化して前後の類似度を求め、該類似度と分割定義プロパティに設定された第１閾値とを比較し、類似度が第１閾値よりも小さい場合はテキストを分割した第２構造化文書を生成し、又は、前後する前方のテキストの最下行の一文と、後方のテキストの最初の一文のそれぞれをベクトル化して前後の類似度を求め、該類似度と分割定義プロパティに設定された第２閾値とを比較し、類似度が第２閾値よりも大きい場合は分割された前後のテキストを結合した第２構造化文書を生成する分割・結合部２１１（図示せず）と、画像管理リストとテーブル管理リストを参照し、第２構造化文書に画像及び表を挿入したＨＴＭＬファイルを生成するＨＴＭＬ化部とを備える。これにより、テキストを適切な長さに分割することができる。 Each of the sentences constituting the divided text constituting the first structured document is vectorized to obtain the similarity before and after, and the similarity is compared with the first threshold value set in the division definition property to obtain the similarity. If is less than the first threshold, generate a second structured document that divides the text, or vectorize each of the bottom line sentence of the front text and the first sentence of the back text before and after. The similarity is obtained, the similarity is compared with the second threshold set in the division definition property, and if the similarity is larger than the second threshold, the second structured document in which the texts before and after the division are combined is combined. It is provided with a division / combination unit 211 (not shown) for generating the image and an HTML unit for generating an HTML file in which an image and a table are inserted in a second structured document with reference to an image management list and a table management list. This allows the text to be split into appropriate lengths.

（評価結果）
本実施形態に係る構造化文書作成装置１００による構造化文書の作業効率の向上の度合いを評価する目的で、手作業で構造化文書を作成した場合と比較を行った。構造化する対象の文書は、327頁29ファイルから成るE社マニュアルと649頁4ファイルから成るＦ社マニュアルの2つを用いた。 (Evaluation results)
For the purpose of evaluating the degree of improvement in the work efficiency of the structured document by the structured document creating apparatus 100 according to the present embodiment, a comparison was made with the case where the structured document was manually created. We used two documents to be structured, the E company manual consisting of 327 pages and 29 files and the F company manual consisting of 649 pages and 4 files.

表１に示すように構造化文書作成に要する作業時間は約1/6に短縮することができた。 As shown in Table 1, the work time required to create a structured document could be reduced to about 1/6.

次に、本実施形態に係る構造化文書作成装置２００によるテキストの分割による検索精度への影響を評価した結果を表２に示す。 Next, Table 2 shows the results of evaluating the influence of the text division by the structured document creation device 200 according to the present embodiment on the search accuracy.

表２に示す様に本実施形態によれば、検索して回答する作業の迅速化が可能である。本実施に係る構造化文書作成装置２００で作成した構造化文書を例えばお客様相談室で用いた場合、お客様への応答時間を短縮し、お客様満足度を向上させる効果が得られる。 As shown in Table 2, according to the present embodiment, it is possible to speed up the work of searching and answering. When the structured document created by the structured document creating apparatus 200 according to the present implementation is used, for example, in the customer consultation room, the effect of shortening the response time to the customer and improving the customer satisfaction can be obtained.

以上説明したように本実施形態の構造化文書作成装置１００によれば、区分された非構造化文書であればあらゆる種類の文書に対応できる汎用性の高い構造化文書作成装置とその方法を提供することができる。また、本実施形態の構造化文書作成装置２００によれば、構造化された一群のテキストの意味を明確にすることができる。 As described above, the structured document creation device 100 of the present embodiment provides a highly versatile structured document creation device and a method thereof that can handle all types of documents as long as they are classified unstructured documents. can do. Further, according to the structured document creation device 200 of the present embodiment, the meaning of a structured group of texts can be clarified.

なお、上記の実施形態は、ＰＤＦ生成部１０を備える例で説明を行ったがＰＤＦ生成部１０は無くても構わない。区分された非構造化文書は、構造化文書作成装置１００，２００に直接入力されるようにしてもよい。このように本発明は、上記した実施形態に限定されるものではなく、その要旨の範囲内で数々の変形が可能である。 Although the above embodiment has been described with an example including the PDF generation unit 10, the PDF generation unit 10 may not be provided. The classified unstructured document may be directly input to the structured document creation devices 100 and 200. As described above, the present invention is not limited to the above-described embodiment, and many modifications can be made within the scope of the gist thereof.

１０：ＰＤＦ生成部
２０：文書構造取得部
３０：管理リスト生成部
４０：テキストフレーム要素配列部
５０：テキスト情報構造化部
５１：分割定義プロパティ
６０、２６０：ＨＴＭＬ化部
２１０：分割部
２１１：分割・結合部 10: PDF generation unit 20: Document structure acquisition unit 30: Management list generation unit 40: Text frame element arrangement unit 50: Text information structuring unit 51: Division definition property 60, 260: HTML conversion unit 210: Division unit 211: Division・ Joint

Claims

A structured document creation device that creates a structured document described to fit a predetermined type from a classified unstructured document.
A document structure acquisition unit that acquires text frame elements, table frame elements, and image frame elements for each page of the unstructured document.
A management list generation unit that generates a management list that lists the positions of the text frame element, the table frame element, and the image frame element on the page in association with each other and their contents.
A text frame element array section that arranges the text frame elements in the writing order for each page with reference to the management list.
A text information structuring unit that generates a first structured document in which the arranged text frame elements are divided based on the rules set in the division definition property.
It is provided with an HTML conversion unit that generates an HTML file in which the table frame element and the image frame element are inserted into the first structured document with reference to the management list.
The rules include rules for identifying font size, character position, and heading.
The text information structuring unit adds a heading tag and a paragraph tag at a level corresponding to the character position and the font size to the text frame element.
A structured document creation device characterized by this.

Each of the sentences constituting the divided text constituting the first structured document is vectorized to obtain the similarity before and after, and the similarity is compared with the threshold value set in the division definition property to obtain the similarity. Is smaller than the threshold value, the text is divided into a second structured document, which is provided with a division unit.
The HTML processing unit according to claim 1, wherein the HTML conversion unit generates an HTML file in which the table frame element and the image frame element are inserted also for the second structured document with reference to the management list. Structured document creation device.

It is a structured document creation method performed by a structured document creation device.
A document structure acquisition step for acquiring text frame elements, table frame elements, and image frame elements for each page of a partitioned unstructured document.
A management list generation step for generating a management list in which the positions of the text frame element, the table frame element, and the image frame element on the page and their contents are associated with each other and listed.
A text frame element array step that arranges the text frame elements in writing order for each page with reference to the management list.
A text information structuring step that generates a first structured document in which the arranged text frame elements are divided according to a rule set in the division definition property.
With reference to the management list, the first structured document is subjected to the HTMLization step of generating the HTML file in which the table frame element and the image frame element are inserted, and the rule determines the font size, character position, and heading. Including rules to identify
The text information structuring step adds a heading tag and a paragraph tag at a level corresponding to the character position and the font size to the text frame element.
A structured document creation method characterized by this.

Each of the sentences constituting the divided text constituting the first structured document is vectorized to obtain the similarity before and after, and the similarity is compared with the threshold value set in the division definition property to obtain the similarity. If is less than the threshold, the split step to generate a second structured document that splits the text, and
And
The structured document creation according to claim 3, wherein the HTMLization step generates an HTML file in which the table frame element and the image frame element are inserted also for the second structured document with reference to the management list. Method.