JP2006053612A

JP2006053612A - Document conversion device and method, and document conversion program

Info

Publication number: JP2006053612A
Application number: JP2004232785A
Authority: JP
Inventors: Mitsuo Nunome; 光生布目; Yasuto Ishitani; 康人石谷
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2004-08-09
Filing date: 2004-08-09
Publication date: 2006-02-23
Anticipated expiration: 2024-08-09
Also published as: JP3962732B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a document processor capable of properly extracting a table structure or a portion of a structure similar to a table in a document, and generating an output document wherein the extracted table structure or protion of the structure similar to the table can be used as a structured document. <P>SOLUTION: This document processor 1 has: a knowledge dictionary part 7 defining attribute impartment information; a conversion rule part 9 defining output restriction information; an attribute impartment part 3 deciding presence/absence of a predetermined character string to a character string appearing in a specified conversion target part, and imparting attribute information corresponding to the predetermined character string on the basis of the attribute impartment information of the knowledge dictionary part; a conversion mapping generation part 4 determining regularity of appearance of the predetermined attribute information defined in the output restriction information from the attribute information imparted by the attribute impartment part, and generating conversion mapping of the output restriction information; and an output document generation part 5 generating and outputting the output document from information about the generated conversion mapping in a form according with the output restriction information. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

本発明は、文書変換装置及び方法、文書変換プログラムに関し、特に、文書中の表構造あるいは表に類似した構造の部分を抽出して構造化文書に変換して出力する文書変換装置及び方法、文書変換プログラムに関する。 The present invention relates to a document conversion apparatus and method, and a document conversion program, and more particularly to a document conversion apparatus and method for extracting a table structure in a document or a portion having a structure similar to a table, converting the extracted portion into a structured document, and a document Concerning conversion program.

近年、文書の構造化が注目されている。電子テキスト文書にタグを付与して構造化文書を作成する場合には、入力文書毎に記載内容を人が検討し、入力文書と出力文書の構造間の対応付け、いわゆる変換写像を人手によって変換スクリプトや変換プログラムとして記述したり、変換の手掛かりとなるマーキング記号を人が付与し、この後スクリプトによる一括変換を行なうことで実現していた。 In recent years, document structuring has attracted attention. When creating a structured document by adding a tag to an electronic text document, a person reviews the description of each input document, and the correspondence between the structure of the input document and the output document, the so-called conversion map, is converted manually. This was realized by writing as a script or conversion program, or by assigning a marking symbol that is a clue to conversion, and then performing batch conversion using a script.

一方、文書には、いわゆる表形式の部分が含まれることがある。そのような表形式の部分を構造化する技術として、特開２００１−３２５２８４公報に記載のものがある。この技術によれば、表構造の各要素を認識等することができる。
特開２００１−３２５２８４公報 On the other hand, a document may include a so-called tabular part. As a technique for structuring such a tabular portion, there is a technique described in JP-A-2001-325284. According to this technique, each element of the table structure can be recognized.
JP 2001-325284 A

しかし、上述した特許文献１に記載の技術の場合、表形式の部分を明確に指定して文書処理装置に入力しなければならないのであるが、家電マニュアルや医薬品添付文書のように閲覧性を優先した文書は、表構造あるいは表に類似した構造の記述の仕方が多様であるため、上記技術を利用して、閲覧性を優先した文書から内容を正しく取り出し、構造化文書を作成することが困難だった。 However, in the case of the technique described in Patent Document 1 described above, the tabular part must be clearly specified and input to the document processing apparatus. Because of the variety of ways to describe a table structure or a structure similar to a table, it is difficult to create a structured document by using the above technology to correctly extract the contents from a document that prioritizes browsing was.

そこで、本発明は、文書中の表構造あるいは表に類似した構造の部分を適切に抽出し、かつこの抽出された表構造あるいは表に類似した構造の部分を構造化文書として利用できる出力文書を生成することができる文書処理装置を提供することを目的とする。 Therefore, the present invention appropriately outputs an output document that can appropriately extract a table structure or a structure portion similar to the table in the document and use the extracted table structure or structure portion similar to the table as a structured document. An object of the present invention is to provide a document processing apparatus that can be generated.

本発明の文書処理装置は、予め決められた文字列に対して属性情報を付与するための属性付与情報を定義した知識辞書部と、入力文書の変換対象箇所に対して変換を行って出力文書を出力するための出力制約情報を定義した変換ルール部と、特定された前記変換対象箇所に出現する文字列に対して、前記予め決められた文字列の有無を判断し、前記知識辞書部の前記属性付与情報に基づいて前記予め決められた文字列に対応する前記属性情報を付与する属性付与部と、前記属性付与部によって付与された前記属性情報の中から、前記出力制約情報に定義された予め決められた属性情報の出現の規則性を決定し、この決定された前記出現の規則性に基づいて、前記出力制約情報の変換写像を生成する変換写像生成部と、前記変換写像の情報を、前記出力制約情報に合致した形式で出力文書を生成し出力する出力文書生成部とを有する。 The document processing apparatus according to the present invention includes a knowledge dictionary unit that defines attribute assignment information for assigning attribute information to a predetermined character string, and performs conversion on a conversion target portion of an input document to output an output document. A conversion rule part that defines output constraint information for outputting the character string, and the presence / absence of the predetermined character string with respect to the character string that appears at the specified location to be converted, An attribute assigning unit that assigns the attribute information corresponding to the predetermined character string based on the attribute assigning information, and the attribute information given by the attribute assigning unit, is defined in the output constraint information A conversion map generation unit for determining regularity of appearance of predetermined attribute information, and generating a conversion map of the output constraint information based on the determined regularity of appearance, and information of the conversion map The Serial generates output document matches the format in the output restriction information and an output document generator outputting.

本発明の文書処理装置によれば、文書中の表構造あるいは表に類似した構造の部分を適切に抽出し、かつこの抽出された表構造あるいは表に類似した構造の部分を構造化文書として利用できる出力文書を生成することができる。 According to the document processing apparatus of the present invention, a table structure in a document or a portion having a structure similar to a table is appropriately extracted, and the extracted table structure or a portion having a structure similar to a table is used as a structured document. An output document that can be generated can be generated.

以下、図面を参照して本発明の実施の形態を説明する。 Embodiments of the present invention will be described below with reference to the drawings.

図１は、本実施の形態である文書構造化処理装置の構成を示す機能ブロック図である。文書構造化処理装置１は、変換対象箇所特定部２と、属性解析部３と、変換写像生成部４と、変換実行部５と、知識辞書解析部６と、知識辞書部７と、変換ルール解析部８と、変換ルール部９とを含む。文書構造化処理装置１は、例えば、各種プログラムと各種データが記憶可能な記憶装置を有するコンピュータ装置である。各部のプログラム及び必要なデータは、コンピュータの記憶装置に記憶されている。 FIG. 1 is a functional block diagram showing the configuration of the document structuring apparatus according to the present embodiment. The document structuring apparatus 1 includes a conversion target location specifying unit 2, an attribute analysis unit 3, a conversion mapping generation unit 4, a conversion execution unit 5, a knowledge dictionary analysis unit 6, a knowledge dictionary unit 7, and a conversion rule. An analysis unit 8 and a conversion rule unit 9 are included. The document structuring apparatus 1 is a computer device having a storage device capable of storing various programs and various data, for example. The programs and necessary data for each part are stored in a storage device of the computer.

本明細書における各「部」は、実施の形態の各機能に対応する概念的なもので、必ずしも特定のハードウエアやソフトウエア・ルーチンに１対１には対応しない。従って、本明細書では、以下、実施の形態の各機能を有する仮想的回路ブロック（部）を想定して実施の形態を説明する。また、本実施の形態における各手順の各ステップは、その性質に反しない限り、実行順序を変更し、複数同時に実行し、あるいは実行毎に異なった順序で実行してもよい。 Each “unit” in this specification is a conceptual one corresponding to each function of the embodiment, and does not necessarily correspond to a specific hardware or software routine on a one-to-one basis. Therefore, in the present specification, the embodiment will be described below assuming a virtual circuit block (unit) having each function of the embodiment. In addition, each step of each procedure in the present embodiment may be executed in a different order for each execution by changing the execution order and performing a plurality of steps at the same time, as long as it does not contradict its nature.

なお、文書構造化処理装置１は、例えば汎用のコンピュータ装置を基本ハードウエアとして用いることができる。そして、変換対象箇所特定部２、属性解析部３、変換写像生成部４、変換実行部５、知識辞書解析部６及び変換ルール解析部８は、上記コンピュータ装置に搭載されたプロセッサにプログラムを実行させることにより実現することができる。このときに文書構造化装置１は、上記のプログラムをコンピュータ装置に予めインストールされてもよいし、ＣＤ−ＲＯＭ等のようなリムーバブルな記憶媒体に記録して、あるいはネットワークを介して上記のプログラムを配布し、このプログラムをコンピュータ装置に適宜インストールして実現されても良い。知識辞書部７と変換ルール部９は、上記コンピュータ装置に内蔵されたメモリやハードディスク装置等の記憶デバイス、上記コンピュータに外付けされたメモリやハードディスク装置等の記憶デバイス、さらにはフロッピー（登録商標）ディスク等のようなリムーバブルな記憶媒体等を適宜利用して実現することができる。 Note that the document structuring apparatus 1 can use, for example, a general-purpose computer device as basic hardware. Then, the conversion target location specifying unit 2, the attribute analysis unit 3, the conversion map generation unit 4, the conversion execution unit 5, the knowledge dictionary analysis unit 6, and the conversion rule analysis unit 8 execute a program on a processor mounted on the computer device. This can be realized. At this time, the document structuring apparatus 1 may install the above program in a computer device in advance, or record the program on a removable storage medium such as a CD-ROM or via a network. It may be realized by distributing and installing this program on a computer device as appropriate. The knowledge dictionary unit 7 and the conversion rule unit 9 include a storage device such as a memory and a hard disk device built in the computer device, a storage device such as a memory and a hard disk device externally attached to the computer, and a floppy (registered trademark). It can be realized by appropriately using a removable storage medium such as a disk.

文書構造化処理装置１において、変換対象箇所特定部２には、入力文書のデータが入力される。入力文書は、例えば、図２に示すように、文書中に表構造あるいは表に類似した構造の部分を含む文書である。図２は、入力文書の例を示す図である。表構造あるいは表に類似した構造は、行と列からなり、表の要素であるセル内容の情報を含む。 In the document structuring apparatus 1, input document data is input to the conversion target location specifying unit 2. For example, as shown in FIG. 2, the input document is a document including a table structure or a portion having a structure similar to the table. FIG. 2 is a diagram illustrating an example of an input document. A table structure or a structure similar to a table includes rows and columns, and includes information on cell contents that are elements of the table.

以下、本明細書において、表構造あるいは表に類似した構造の部分を単に表形式の部分と記す。また、以下の説明では、各セル内容には、文字列の情報が含まれているものを例として示す。 Hereinafter, in this specification, a part of a table structure or a structure similar to a table is simply referred to as a tabular part. Further, in the following description, the contents of each cell are shown as an example in which character string information is included.

変換対象箇所特定部２は、入力文書を受理し、入力文書の中から構造化処理を行う対象箇所あるいは対象領域を特定する。図２に示すように、入力文書には、表形式の部分と、表形式でない部分が含まれる。例えば、図２に示す紙等に印刷された入力文書をＯＣＲ（光学的文字読取装置）で読み込み変換した結果、文字列データすなわちテキストデータとして、図３に示すように変換される。図３は、テキストデータに変換された入力文書の例を示す図である。なお、ＯＣＲソフトウエアによっては、図３に示す形式以外の形式の文字列データに変換されるものもあるが、ここでは、入力文書は図３の形式でテキストデータに変換されたものとして、以下説明する。なお、入力文書はＯＣＲソフトウエアの出力に限るものではなく、テキストデータとして事前に人手により作成したものや、例えばブラウザに表示されている表形式を含むデータをcut&pasteによりテキストデータとして格納されたものでも良い。 The conversion target location specifying unit 2 receives the input document and specifies the target location or target area to be structured from the input document. As shown in FIG. 2, the input document includes a tabular part and a non-tabular part. For example, an input document printed on paper or the like shown in FIG. 2 is read and converted by an OCR (optical character reader), and is converted as character string data, that is, text data as shown in FIG. FIG. 3 is a diagram illustrating an example of an input document converted into text data. Note that some OCR software may convert the character string data into a format other than the format shown in FIG. 3, but here the input document is assumed to be converted into text data in the format shown in FIG. explain. Note that the input document is not limited to the output of the OCR software, but has been manually created as text data in advance, for example, data including the table format displayed in the browser is stored as text data by cut & paste But it ’s okay.

このような入力文書の形式が持つべき必要な条件とは、テキストデータとして一元的に枚挙されたデータにおいて、表の各要素に対応する(空白も含む)各テキスト内容が、改行やタブその他セパレータ記号などによって明示的に区別できるものであって、さらに表構造中の要素は行方向でも列方向でも良いが、ある一定の順序に従って枚挙されているものを想定している。 The necessary conditions that such an input document format should have are that each text content (including white space) corresponding to each element of the table is replaced with line breaks, tabs, and other separators in data that is centrally listed as text data. It is assumed that the elements can be clearly distinguished by symbols and the like, and the elements in the table structure may be in the row direction or the column direction, but are listed according to a certain order.

属性解析部３では、まず前段の変換対象箇所特定部２で特定された変換対象箇所からテキスト要素を抽出する。なお、ここでのテキスト要素とは、図３に示したようなベタテキストの場合であれば、行単位で区切られた場合の各行を指す。また、既に入力がタグ付き文書、例えば粗い粒度のタグが付いたＨＴＭＬ(Hyper Text Markup Language)形式の文書であれば、タグで囲まれた個々の文字列を示すものとする。 The attribute analysis unit 3 first extracts a text element from the conversion target part specified by the previous conversion target part specifying unit 2. Note that the text element here indicates each line when it is divided in units of a solid text as shown in FIG. If the input is already a tagged document, for example, a document in HTML (Hyper Text Markup Language) format with a tag of coarse granularity, each character string enclosed by the tag is indicated.

知識辞書部７は、属性解析部３において属性情報を付与するために照合すべきキーワード等を事前に登録している。知識辞書解析部６は、予め決められた文字列等のキーワードの語彙の意味と、そのキーワードに対して付与すべき属性情報を定義している。知識辞書部７は、図１の点線に示すように知識辞書解析部６を介して、あるいは知識辞書解析部６と変換ルール解析部８を介して属性解析部３へキーワードと属性情報を供給する。 The knowledge dictionary unit 7 registers keywords and the like to be collated in order to give attribute information in the attribute analysis unit 3 in advance. The knowledge dictionary analysis unit 6 defines the meaning of a vocabulary of a keyword such as a predetermined character string and attribute information to be given to the keyword. The knowledge dictionary unit 7 supplies keywords and attribute information to the attribute analysis unit 3 via the knowledge dictionary analysis unit 6 or via the knowledge dictionary analysis unit 6 and the conversion rule analysis unit 8 as indicated by the dotted line in FIG. .

属性解析部３では、知識辞書部７を参照してテキスト要素毎に、属性情報としての属性値を付与する。この付与された属性値は、属性解析結果として、対応する元のテキスト要素と共に、変換写像生成部４へ出力される。属性解析部３の処理内容については後述する。 The attribute analysis unit 3 refers to the knowledge dictionary unit 7 and assigns an attribute value as attribute information for each text element. The assigned attribute value is output to the conversion map generation unit 4 together with the corresponding original text element as an attribute analysis result. The processing contents of the attribute analysis unit 3 will be described later.

変換ルール部９は、構造化された文書を出力するための出力制約情報を定義している。 The conversion rule unit 9 defines output constraint information for outputting a structured document.

変換ルール部９の内容である入力文書と出力文書構造との対応付けは、ユーザが望む形式に沿った構造化文書を得るために、ユーザによって定義される。
変換ルール部９の内容は、ユーザが望む形式に沿った構造化文書を得るために、ユーザによって定義される。 The association between the input document and the output document structure, which is the content of the conversion rule unit 9, is defined by the user in order to obtain a structured document in a format desired by the user.
The contents of the conversion rule part 9 are defined by the user in order to obtain a structured document in a format desired by the user.

出力制約情報は、変換ルール解析部８を介して変換写像生成部４へ供給される。なお、本実施の形態では、変換ルール部９は、属性付与を行う抽出対象の属性情報を含むので、変換ルール部９の情報は、変換ルール解析部８を介して属性解析部３へも供給されている。 The output constraint information is supplied to the conversion map generation unit 4 via the conversion rule analysis unit 8. In the present embodiment, the conversion rule unit 9 includes the attribute information to be attributed, and the information of the conversion rule unit 9 is also supplied to the attribute analysis unit 3 via the conversion rule analysis unit 8. Has been.

変換写像生成部４は、前段の属性解析部３において付与された各テキスト要素の属性値の中から、予め決められた属性値の出現の規則性などに基づいて、変換写像を生成する。変換写像生成部４の処理内容については後述する。 The conversion map generation unit 4 generates a conversion map based on the regularity of appearance of predetermined attribute values from among the attribute values of each text element assigned by the attribute analysis unit 3 in the previous stage. The processing contents of the conversion map generation unit 4 will be described later.

変換実行部５は、前段までの処理で得られた変換写像生成の結果を利用して、ここまでの処理で得られたテキスト要素を、変換ルールに予め定義された所定の変換後構造へと埋め込み、所望の構造化文書、例えばＸＭＬ（Extensible Markup Language）のデータを得る出力文書生成部である。変換実行部５の処理内容については後述する。 The conversion execution unit 5 uses the result of the conversion map generation obtained in the process up to the previous stage, and converts the text element obtained in the process so far into a predetermined post-conversion structure defined in the conversion rule. An output document generation unit that obtains a desired structured document, for example, XML (Extensible Markup Language) data. The processing contents of the conversion execution unit 5 will be described later.

まず、文書構造化処理装置１における処理の全体の流れについて説明する。図４は、文書構造化処理装置１における入力文書に対して行われる構造化処理の全体の流れの例を示すフローチャートである。 First, the overall flow of processing in the document structuring apparatus 1 will be described. FIG. 4 is a flowchart showing an example of the overall flow of the structuring process performed on the input document in the document structuring apparatus 1.

例えば、表構造部分を含む文書に対して構造化処理を行いたいユーザは、文書構造化処理装置１に文書を入力する。この入力文書は、電子的テキストデータを含む文書データであれば、その電子データであるが、元々の文書が紙文書であれば、ＯＣＲソフトウエアによってイメージデータからテキストデータへ変換された電子データである。 For example, a user who wants to perform a structuring process on a document including a table structure part inputs the document into the document structuring apparatus 1. If the input document is document data including electronic text data, the input document is electronic data. If the original document is a paper document, the input document is electronic data converted from image data to text data by OCR software. is there.

入力されたデータに対して、変換対象箇所の特定が行われる（ステップ（以下Ｓと略す）1）。Ｓ１の処理は、変換対象箇所特定部２において実行される。変換対象箇所特定部２では、入力文書中、例えば表層表現である「副作用」などの部分構造の開始となる語を事前にコンピュータの記憶装置の所定の記憶領域に登録しておき、入力文書内の章や節に相当する領域をそれらの語と語の間に位置する領域として対象箇所の特定を行なう。ここでは入力文書の全体の中から、図３における入力文書の「4.副作用」というキーワード以降の部分が抽出対象として得られているものとする。 The conversion target part is specified for the input data (step (hereinafter abbreviated as S) 1). The process of S1 is executed in the conversion target location specifying unit 2. The conversion target location specifying unit 2 registers in advance a word that starts a partial structure such as a “side effect” that is a surface expression in an input document in a predetermined storage area of a computer storage device in advance. The area corresponding to the chapter or section is identified as an area located between the words. Here, it is assumed that the portion after the keyword “4. Side Effects” of the input document in FIG.

なお、入力文書の中のどの領域あるいは範囲を変換対象箇所とするかは、特定の章等を指定したり、マニュアルで人が指定してもよい。 It should be noted that a specific chapter or the like may be designated by a person manually or by specifying which region or range in the input document is a conversion target portion.

この変換対象箇所特定部２の処理によって、ユーザが入力した入力文書中において、変換対象となる箇所あるいは領域（以下、変換対象箇所と記す）だけが特定すなわち抽出される。 By the processing of the conversion target part specifying unit 2, only a part or a region to be converted (hereinafter referred to as a conversion target part) is specified, that is, extracted from the input document input by the user.

そして、属性解析部３において、変換対象箇所特定部２により抽出された変換対象箇所のデータから、テキスト要素の抽出が行われ（Ｓ２）、この抽出された各テキスト要素に対する属性値すなわちクラスの付与が行われる（Ｓ３）。 Then, the attribute analysis unit 3 extracts text elements from the data of the conversion target part extracted by the conversion target part specifying unit 2 (S2), and assigns an attribute value, that is, a class to each of the extracted text elements. Is performed (S3).

最後に、変換写像生成部４において、クラス出現の規則性を抽出し（Ｓ４）、変換写像の決定を行う（Ｓ５）。最後に、変換実行部５が、予め決められた変換後構造にテキスト要素を埋め込む変換処理を実行する（Ｓ６）。 Finally, the conversion map generation unit 4 extracts regularity of class appearance (S4) and determines the conversion map (S5). Finally, the conversion execution unit 5 executes a conversion process for embedding a text element in a predetermined post-conversion structure (S6).

次に、各部の処理を詳細に説明する。
属性解析部３は、上述したように、変換対象箇所のデータからテキスト要素の抽出（Ｓ２）を行い、さらに、属性の付与（Ｓ３）を行う。この属性解析部３におけるこれらの処理について、図５を用いて更に詳細に説明する。 Next, the processing of each unit will be described in detail.
As described above, the attribute analysis unit 3 extracts a text element (S2) from the data of the conversion target portion, and further assigns an attribute (S3). These processes in the attribute analysis unit 3 will be described in more detail with reference to FIG.

図５は、属性解析部３の処理の流れの例を示すフローチャートである。
図５に示すように、まず変換ルールの読み込みが行われる（Ｓ１１）。図６は、変換ルールの記述例を示す。図６に変換ルールにおいて、＜抽出対象＞タグで囲まれた部分１０１が、属性解析部３において抽出され付与される属性が記述された抽出対象属性記述部である。すなわち、どのような属性を選択して属性の付与を行なうかの指定が、この変換ルールの読み込みによって行なわれる。なお、変換ルールにおいて、抽出対象の指定が存在しない場合には、デフォルトとして予め決められている属性を選択して解析を行なう。図６には、「ラベル」、「項目名」、「セル内容」及び「コメント行」の４つの属性が、解析を行なう抽出対象として記述されている。 FIG. 5 is a flowchart showing an example of the processing flow of the attribute analysis unit 3.
As shown in FIG. 5, first, conversion rules are read (S11). FIG. 6 shows a description example of the conversion rule. In the conversion rule in FIG. 6, a portion 101 surrounded by <extraction target> tags is an extraction target attribute description portion in which attributes extracted and assigned by the attribute analysis unit 3 are described. In other words, designation of which attribute is selected and attributed is performed by reading this conversion rule. In addition, in the conversion rule, when there is no designation of the extraction target, an attribute predetermined as a default is selected for analysis. In FIG. 6, four attributes of “label”, “item name”, “cell contents”, and “comment line” are described as extraction targets to be analyzed.

なお以下、各テキスト要素に対する属性のことをクラス分けもしくはクラス分類あるいは単にクラス等と呼称する。 Hereinafter, the attribute for each text element is referred to as classification, classification, or simply class.

Ｓ１１の次に、変換対象箇所となる、前段の変換対象箇所で文書中から特定された部分はタグ付き文書か否かの判定が行なわれる（Ｓ１２）。この変換対象箇所がタグ付き文書として解析可能であれば（Ｓ１２のＹＥＳ）、この変換対象箇所を例えば木構造として展開する（Ｓ１３）。続いて、木構造の深さ優先でテキスト要素の抽出を行ない、この際、各テキスト要素の親タグ名や、この時点で抽出しているテキスト要素が文書全体を表す木構造の根を開始地点としてどれほどの深さにあるかといった値も、テキスト要素から抽出された属性値として格納しておく（Ｓ１４）。 Next to S11, it is determined whether or not the part specified from the document at the previous conversion target part, which is the conversion target part, is a tagged document (S12). If the conversion target location can be analyzed as a tagged document (YES in S12), the conversion target location is expanded as a tree structure, for example (S13). Next, text elements are extracted with priority given to the depth of the tree structure. At this time, the parent tag name of each text element and the root of the tree structure in which the text element extracted at this point represents the entire document are started. As for the depth, the value such as the depth is stored as an attribute value extracted from the text element (S14).

一方、変換対象箇所に構造が無くいわゆるベタテキストとなっている場合には、Ｓ１２でＮＯとなり、一行毎に読み込み、各行をそれぞれ１つのテキスト要素として格納する（Ｓ１５）。ここまでの処理により、変換対象箇所の部分がタグ付き文書とベタテキストのいずれの形式の文書であっても、テキスト要素が文書中での出現順序と共に格納された状態となる。 On the other hand, if there is no structure in the conversion target portion and the text is a so-called solid text, NO is determined in S12, each line is read, and each line is stored as one text element (S15). By the processing so far, the text element is stored together with the appearance order in the document regardless of whether the portion to be converted is a tagged document or a solid text document.

具体的には、図３に示す入力文書のテキストデータから図７に示すようなテキスト要素リストが作成される。図７は、テキスト要素リストの例を示す図である。図７に示すように、テキスト要素毎に、出現順序に対応する識別子、すなわちテキスト識別子（以下、テキスト要素ＩＤと記す）が付与される。すなわち、図３の入力文書に対して、図５のＳ１２からＳ１５までの処理を実行すると、図７に示すようなテキスト要素リストが生成される。なお、テキスト要素ＩＤは、入力文書における各テキスト要素の出現の順序を表しているので、後述するように、変換写像生成部４において、予め決められた特定の属性値がどのような順番で出願しているのか、或いは特定の属性値の組がどのように連続的に出現しているのか、を判定することに利用できる。 Specifically, a text element list as shown in FIG. 7 is created from the text data of the input document shown in FIG. FIG. 7 is a diagram illustrating an example of a text element list. As shown in FIG. 7, an identifier corresponding to the appearance order, that is, a text identifier (hereinafter referred to as a text element ID) is assigned to each text element. That is, when the processing from S12 to S15 in FIG. 5 is executed on the input document in FIG. 3, a text element list as shown in FIG. 7 is generated. Since the text element ID represents the order of appearance of each text element in the input document, the conversion mapping generation unit 4 applies the predetermined specific attribute values in any order as will be described later. It can be used to determine whether or not a specific set of attribute values appears continuously.

この結果を受け、続く属性付与処理（Ｓ１６）では、各テキスト要素について、予め決められた文字列長を有するか、あるいは予め決められた文字種別、予め決められた文字列が含まれるかをチェックし、各行の属性値、すなわち各テキスト要素の属性値を決定し、各テキスト要素に対して属性値の付与を行う。なお、上述したように、予め決められた文字列等のキーワードと対応する属性情報は、例えば図８に示すように知識辞書部７に予め定義され記憶されている。 In response to this result, in the subsequent attribute assigning process (S16), it is checked whether each text element has a predetermined character string length, or whether a predetermined character type and a predetermined character string are included. Then, the attribute value of each line, that is, the attribute value of each text element is determined, and the attribute value is assigned to each text element. As described above, attribute information corresponding to a keyword such as a predetermined character string is defined and stored in advance in the knowledge dictionary unit 7 as shown in FIG. 8, for example.

図８は、知識辞書部７に定義された項目名とラベルを抽出するための情報例を説明するための図である。すなわち、図８に示すように、「；」の右側に記載されている文字列と、この文字列に対応する属性情報が「；」の左側に記載されており、リスト形式で知識辞書部７に登録されている。 FIG. 8 is a diagram for explaining an example of information for extracting item names and labels defined in the knowledge dictionary unit 7. That is, as shown in FIG. 8, the character string described on the right side of “;” and the attribute information corresponding to this character string are described on the left side of “;”. It is registered in.

属性付与処理Ｓ１６における処理について説明する。図９は、属性付与処理Ｓ１６の流れの例を示すフローチャートである。まず初めに、変換ルール中の記述から、抽出対象属性記述部１０１の情報を読み込む（Ｓ２１）。これは、図６中の＜抽出対象＞タグによって囲まれた領域内に記述された各要素、図６の例では「ラベル」、「項目名」、「セル内容」及び「コメント行」の要素を読み込むことに相当する。 The process in attribute assignment process S16 is demonstrated. FIG. 9 is a flowchart showing an example of the flow of attribute assignment processing S16. First, the information of the extraction target attribute description unit 101 is read from the description in the conversion rule (S21). This is because each element described in the area surrounded by the <extraction target> tag in FIG. 6, elements of “label”, “item name”, “cell contents”, and “comment line” in the example of FIG. Is equivalent to reading.

次に、処理対象の行（テキスト要素）を一行分読み込む（Ｓ２２）。そして、Ｓ２１で読み込んだ抽出対象属性記述部１０１の情報から抽出対象の属性であるか否かを判定し（Ｓ２３，Ｓ２５，Ｓ２７，Ｓ２９）、抽出対象として指示されている属性であれば、図８に示した知識辞書部７で指定されている当該属性情報に係わる文字列がＳ２２で読み込んだ行のテキスト要素中に含まれているか否かを判定する（Ｓ２４，Ｓ２６，Ｓ２８，Ｓ３０）。 Next, one line (text element) to be processed is read (S22). Then, it is determined whether or not the attribute is the extraction target attribute from the information of the extraction target attribute description unit 101 read in S21 (S23, S25, S27, S29). It is determined whether or not the character string related to the attribute information specified in the knowledge dictionary section 7 shown in FIG. 8 is included in the text element of the line read in S22 (S24, S26, S28, S30).

まず、言葉の定義を行なう。ここでの「項目名」とは、一般的に表に出現する要素を示す語であって、この具体的として図２の例を用いた場合には“過敏症”、“出血傾向”、“血液”などの１列目に出現する要素を示すものとする。また知識辞書中の“ｉｔｅｍ”とは、ある分野の文書において「項目名」に対応する語や表現に対し予め付与しておく識別子とする。 First, define words. The “item name” here is a word that generally indicates an element that appears in the table. When the example of FIG. 2 is used as this specific example, “hypersensitivity”, “bleeding tendency”, “ It is assumed that an element appearing in the first column such as “blood” is shown. Further, “item” in the knowledge dictionary is an identifier given in advance to a word or expression corresponding to “item name” in a document in a certain field.

具体的な処理例を以下に示す。Ｓ２２で読み込んだ処理対象行のテキスト要素中に、項目名に対応した文字列（図８の“ｉｔｅｍ”に対応する文字列）が出現しているか否かを判断する（Ｓ２３）。図６の例では、項目名の抽出が記述（指示）されているので、Ｓ２３でＹＥＳとなって、図８に示した辞書リストを参照し、属性情報“ｉｔｅｍ”に対応する文字列、例えば「過敏症」、「血液」等が現在の処理対象の行（テキスト要素）の文中に含まれていれば、この処理対象の行（テキスト要素）に対して“項目名”の属性値を付与する（Ｓ２４）。 A specific processing example is shown below. It is determined whether or not a character string corresponding to the item name (character string corresponding to “item” in FIG. 8) appears in the text element of the processing target line read in S22 (S23). In the example of FIG. 6, since the extraction of the item name is described (instructed), YES in S23, referring to the dictionary list shown in FIG. 8, a character string corresponding to the attribute information “item”, for example, If “hypersensitivity”, “blood”, etc. are included in the sentence of the current processing target line (text element), the attribute value of “item name” is assigned to this processing target line (text element). (S24).

同様にして、Ｓ２２で読み込んだ処理対象行のテキスト要素中に、ラベル名に対応した文字列（図８の“ｌａｂｅｌ”に対応する文字列）が出現しているか否かを判断する（Ｓ２５）。図６の例では、ラベル名の抽出が記述（指示）されているので、Ｓ２５でＹＥＳとなって、図８に示した辞書リストを参照し、属性情報“ｌａｂｅｌ”に対応する文字列、例えば「０．１〜５％未満」、「頻度不明」等が現在の処理対象の行（テキスト要素）の文中に含まれていれば、この処理対象の行（テキスト要素）に対して“ラベル”の属性値を付与する（Ｓ２６）。 Similarly, it is determined whether or not a character string corresponding to the label name (character string corresponding to “label” in FIG. 8) appears in the text element of the processing target line read in S22 (S25). . In the example of FIG. 6, since the extraction of the label name is described (instructed), YES is obtained in S25, and the character string corresponding to the attribute information “label”, for example, by referring to the dictionary list shown in FIG. If "0.1 to less than 5%", "Frequency unknown", etc. are included in the sentence of the current processing target line (text element), the "label" for this processing target line (text element) Is assigned (S26).

同様にして、Ｓ２２で読み込んだ処理対象行のテキスト要素がセル内容であるか否かを判断する（Ｓ２７）。なお、セル内容とは、表の要素中、ラベルや項目名などではなく、データの内容が記載された要素を指す。図６の例では、セル内容の抽出が記述（指示）されているので、Ｓ２７でＹＥＳとなって、処理対象の行（テキスト要素）が予め定められた文字列長の閾値（例えば２０文字）を越えていないか、またキーワード等の特定の文字列、例えば「貧血」等が現在の処理対象の行（テキスト要素）の文中に含まれていれば、この処理対象の行をセル内容と判定し、この処理対象の行（テキスト要素）に対して“セル内容”の属性値を付与する（Ｓ２８）。 Similarly, it is determined whether or not the text element of the processing target line read in S22 has cell contents (S27). The cell content refers to an element in which data contents are described, not a label or an item name, among table elements. In the example of FIG. 6, since the extraction of cell contents is described (instructed), the result of S27 is YES, and the processing target line (text element) is a predetermined character string length threshold (for example, 20 characters). If a specific character string such as a keyword, such as “anemia”, is included in the sentence of the current processing target line (text element), this processing target line is determined to be cell contents. Then, an attribute value of “cell contents” is assigned to the processing target line (text element) (S28).

さらにまた、Ｓ２２で読み込んだ処理対象行のテキスト要素がコメント行であるか否かを判断する（Ｓ２９）。図６の例では、コメント行の抽出が記述（指示）されているので、Ｓ２９でＹＥＳとなって、処理対象の行（テキスト要素）が予め定められた文字列長の閾値（例えば２０文字）を越えているか、あるいはキーワード等の特定の文字列、例えば「注１」等が現在の処理対象の行（テキスト要素）の文中に含まれていれば、この処理対象の行をコメント行と判定し、この処理対象の行に対して“コメント行”の属性値を与える（Ｓ３０）。 Furthermore, it is determined whether or not the text element of the processing target line read in S22 is a comment line (S29). In the example of FIG. 6, since extraction of a comment line is described (instructed), the answer to S29 is YES, and the processing target line (text element) is a predetermined character string length threshold (for example, 20 characters). Or a specific character string such as a keyword, such as “Note 1”, is included in the sentence of the current processing target line (text element), this processing target line is determined as a comment line. Then, an attribute value of “comment line” is given to the processing target line (S30).

なお、この場合参照している辞書例において、処理対象の行（テキスト要素）中に出現する文字列として正規表現を含めたものを記載しておき、処理対象の行に対し属性値付与のための判定処理を行なう。 In addition, in the dictionary example referred to in this case, a character string including a regular expression as a character string appearing in the processing target line (text element) is described, and an attribute value is given to the processing target line. The determination process is performed.

Ｓ２９またはＳ３０の次に、変換対象箇所内の全ての行について、上述した処理を行ったか否かが判断される（Ｓ３１）。Ｓ３１で全ての行について処理が終わっていなければ（Ｓ３１のＮＯ）、処理はＳ２２へ戻る。Ｓ３１で全ての行について処理が終わっていれば（Ｓ３１のＹＥＳ）、最終的に属性が付与されなかった行について、一時的に“その他”の仮属性を付与し、各行のテキスト要素とそれに対応するテキストＩＤのほか、本属性解析部３で付与された属性値のデータを後段の処理へ渡す（Ｓ３２）。 Next to S29 or S30, it is determined whether or not the above-described processing has been performed for all the rows in the conversion target portion (S31). If the process has not been completed for all rows in S31 (NO in S31), the process returns to S22. If processing has been completed for all lines in S31 (YES in S31), a temporary attribute of “others” is temporarily assigned to the line to which no attribute is finally given, and the text element of each line and its corresponding In addition to the text ID, the attribute value data given by the attribute analysis unit 3 is passed to the subsequent processing (S32).

以上のように、属性解析部３は、前記知識辞書部７の情報を用いて、変換対象箇所に出現する文字列の中に、予め決められた文字列があるか否かを判断し、行（テキスト要素）に属性値を付与する。属性解析部３は、この処理結果を各行（テキスト要素）に対する属性値としてコンピュータ内の記憶装置に格納する。 As described above, the attribute analysis unit 3 uses the information in the knowledge dictionary unit 7 to determine whether or not there is a predetermined character string in the character string that appears in the conversion target portion. Assign an attribute value to (text element). The attribute analysis unit 3 stores the processing result as an attribute value for each line (text element) in a storage device in the computer.

以上のようにして、変換対象箇所特定部２によって抽出された変換対象箇所のデータに対して、属性解析部３によって、テキスト要素の抽出およびこの抽出された各テキスト要素に対する属性値の付与が行われると、この解析結果データはコンピュータ内の記憶装置に記憶される。 As described above, the attribute analysis unit 3 extracts text elements and assigns attribute values to the extracted text elements with respect to the data of the conversion target portions extracted by the conversion target portion specifying unit 2. The analysis result data is stored in a storage device in the computer.

なお、このテキスト要素に属性値を付与するときの属性値の取得にあたっては、図１において点線で示した形態素解析部１０を用いてテキスト要素に対して形態素解析を行い、この形態素解析結果とこの形態素解析結果に対して付与すべき属性値との関係を形態素解析ルールとして予め用意しておき、形態素解析結果を利用してある特定の品詞の組み合わせがテキスト要素の中に出現するか否かに基づいて属性値の付与を行うようにしてもよい。 In addition, in acquiring the attribute value when the attribute value is given to the text element, the morpheme analysis is performed on the text element using the morpheme analysis unit 10 shown by the dotted line in FIG. The relationship between the attribute value to be assigned to the morpheme analysis result is prepared in advance as a morpheme analysis rule, and whether or not a specific combination of parts of speech appears in the text element using the morpheme analysis result An attribute value may be assigned based on the attribute value.

以上のような処理を変換対象箇所内の全てのテキスト要素に対して行った後で処理を終了し、最終的に得られた解析結果を続く変換写像生成部４に供給する。 After the processing as described above is performed on all the text elements in the conversion target portion, the processing is terminated, and the finally obtained analysis result is supplied to the subsequent conversion map generation unit 4.

次に、変換写像生成部４における処理について説明する。 Next, processing in the conversion map generation unit 4 will be described.

まず、上述した図４のクラス出現の規則性取得（Ｓ５）の処理について説明する。具体的には、各テキスト要素に対応付けられたクラス分け（属性付与）の結果をテキスト要素の出現順に見ていき、類似するクラスに分類されたテキスト要素がどのような規則性をもって出現しているか判定を行なう。ここではクラス分けの結果、図１０に示すようにテキスト要素ＩＤの３，４，５および１９，２０，２１の属性が“ラベル”として分類づけられたとする。図１０は、付与された属性値とテキスト要素ＩＤの具体例を示す図である。 First, the processing for obtaining regularity of class appearance (S5) in FIG. 4 will be described. Specifically, the results of classification (attribute assignment) associated with each text element are viewed in the order in which the text elements appear, and the regularity of the text elements classified into similar classes appears. Judgment is made. Here, as a result of the classification, it is assumed that the attributes of the text element IDs 3, 4, 5 and 19, 20, 21 are classified as “labels” as shown in FIG. FIG. 10 is a diagram showing a specific example of assigned attribute values and text element IDs.

この例によると、文書中で最初に出現した“ラベル”のテキスト要素ＩＤは、ＩＤ３であり、ここからＩＤ５まで連続していることが分かる。“ラベル”が連続して３要素続いていることから、この後続のテキスト要素には、この３要素に対応する３つの“セル内容”が繰り返して出現しているものと仮定し、繰り返しの構造を抽出する。 According to this example, it is understood that the text element ID of the “label” that first appears in the document is ID3, and continues from here to ID5. Since the “label” continues three elements in succession, it is assumed that three “cell contents” corresponding to the three elements appear repeatedly in this subsequent text element, and the repeated structure. To extract.

この結果、ＩＤ６以降のテキスト要素はそれぞれ、［［項目名］，［セル内容］］の様に分割されるため、そのテキスト要素ＩＤの具体例を示すと、［［６］，［７，８，９］］、［［１０］，［１１，１２，１３］］、［［１４］，［１５，１６，１７］］、［［１８］，［１９，２０，２１］］・・・となる。この繰り返し構造は、「コメント行」の出現や、異なる「ラベル」とされたテキスト要素の出現など異なるクラス分けが為されたテキスト要素が出現するか、あるいは変換対象箇所が終端するまで、同様の繰り返し構造が続くと仮定し、テキスト要素に“セル内容”の属性値を付与する。この例の終端箇所を判定するために、クラス分けの結果である属性値が付与された図１０を見てみると“ラベル”と判定されたのは、ＩＤ３−５の他、ＩＤ１９−２１のテキスト要素がある。 As a result, the text elements after ID 6 are each divided into [[item name], [cell contents]], and a specific example of the text element ID is [[6], [7, 8]. , 9]], [[10], [11, 12, 13]], [[14], [15, 16, 17]], [[18], [19, 20, 21]]... Become. This repeated structure is the same until a text element with a different classification such as the appearance of a "comment line" or a text element with a different "label" appears, or until the location to be converted ends. Assuming that the repeating structure continues, an attribute value of “cell contents” is assigned to the text element. In order to determine the end point of this example, when looking at FIG. 10 to which the attribute value that is the result of classification is given, it is determined that “label” is ID19-21 in addition to ID3-5 There is a text element.

このため、ラベルのＩＤ３−５に基づく繰り返し構造の抽出対象箇所は、ＩＤ１９以降のテキスト要素を除き、この直前までに正しく繰り返し構造を取得した構造（［［１４］，［１５，１６，１７］］）をこの抽出処理対象の終端と仮定する。 For this reason, the extraction target part of the repeated structure based on the ID 3-5 of the label is the structure ([[14], [15, 16, 17] in which the repeated structure is correctly acquired immediately before this except for the text elements after ID 19). ]) Is assumed to be the end of this extraction processing target.

続いて、次に抽出すべき繰り返し構造として“ラベル”にクラス分けされたテキスト要素を探す。図１０の例に拠れば、ＩＤ１９−２１のテキスト要素に”ラベル”に該当するテキスト要素が出現していることが分かる。そのため、ＩＤ１９−２１に後続するテキスト要素には先の繰り返し構造を抽出した場合と同じく、３要素に対応する３つのセル内容の繰り返し構造が出現していると仮定し、繰り返しの構造を抽出する。この場合の抽出処理の終端であるが、ＩＤ３８−３９に属性値として「コメント要素」が付与されたテキスト要素が存在しているため、ＩＤ３８以降のテキスト要素は抽出対象外とする。このため、ＩＤ１９−２１に基づく組み合わせは最終的に［［２２］［２３，２４，２５］］、［［２６］，［２７，２８，２９］］、［［３０］，［３１，３２，３３］］、［［３４］，［３５，３６，３７］］となる。 Subsequently, a text element classified as “label” is searched for as a repeated structure to be extracted next. According to the example of FIG. 10, it can be seen that a text element corresponding to the “label” appears in the text element of ID 19-21. Therefore, it is assumed that the repeated structure of the three cell contents corresponding to the three elements appears in the text element subsequent to the ID 19-21, and the repeated structure is extracted. . Although the end of the extraction process in this case, there is a text element to which “comment element” is assigned as an attribute value in ID38-39, the text element after ID38 is excluded from the extraction target. Therefore, the combinations based on the IDs 19-21 are finally [[22] [23, 24, 25]], [[26], [27, 28, 29]], [[30], [31, 32, 33]], [[34], [35, 36, 37]].

これ以降のテキスト要素に対しては、“ラベル”に該当する要素が存在しないため、“ラベル”の抽出処理は終了する。 Since there is no element corresponding to “label” for subsequent text elements, the “label” extraction process ends.

以上のように、変換写像生成部４は、テキスト要素の出現がどのような繰り返しとなっているのかを属性から推測し、この結果を出力制約に当てはめて出力しているので、入力文書に対して、情報の抽出が容易にできない場合や情報抽出結果が不適切な場合でも、表形式の部分の構造化を可能としている。 As described above, the conversion map generation unit 4 estimates the appearance of the text element from the attribute and outputs the result by applying the result to the output constraint. Thus, even when information cannot be easily extracted or the information extraction result is inappropriate, the tabular portion can be structured.

最後に“セル内容”に該当するテキスト要素の抽出が行われる。これは例えば、変換対象箇所の全テキスト要素から、これまでの処理でクラスが確定している要素を取り除くことによって求めることができる。例えば、「セル内容」＝（全テキスト要素）−「ラベル」−「項目名」−「コメント行」−（ラベル出現以前のテキスト要素）という定義式により決定される。この定義式に基づいて得られたテキスト要素が“セル内容”とされる。図１０で「セル内容」と記載した部分がこの結果を示し、この結果には“セル内容”のテキスト要素ＩＤの情報が示されている。 Finally, the text element corresponding to “cell contents” is extracted. This can be obtained, for example, by removing elements whose class has been determined in the process so far from all text elements in the conversion target portion. For example, it is determined by a defining formula of “cell contents” = (all text elements) − “label” − “item name” − “comment line” − (text element before label appearance). A text element obtained on the basis of this definition formula is defined as “cell content”. The portion described as “cell contents” in FIG. 10 shows this result, and the text element ID information of “cell contents” is shown in this result.

以上のように、変換写像生成部４は、属性解析部３によって付与された属性情報（例では、“項目名”、“ラベル”、“セル内容”、“コメント行”）の中から、出力制約情報に定義された予め決められた属性情報（例えば“ラベル”）について、出現の規則性を確定し、この確定された出現の規則性に基づいて出力制約情報の変換写像を生成する。変換写像生成部４は、その変換写像の情報を後段の変換実行部５へ供給する。 As described above, the conversion map generation unit 4 outputs from the attribute information (in the example, “item name”, “label”, “cell contents”, “comment line”) given by the attribute analysis unit 3. For the predetermined attribute information (for example, “label”) defined in the constraint information, the regularity of appearance is determined, and a conversion map of the output constraint information is generated based on the determined regularity of appearance. The conversion map generation unit 4 supplies the information of the conversion map to the subsequent conversion execution unit 5.

なお、図１に点線で示すように、コンピュータに接続されたモニタ等の出力部１１や、キーボード等の入力部１２を用いて、変換写像生成部４において生成された変換写像を、ユーザが確認し検証可能とするためにユーザへ変換写像を提示したり、この提示された変換写像に対しユーザによる写像の修正や編集を可能とするためのユーザ入力を受理したりするようにしてもよい。 As shown by a dotted line in FIG. 1, the user confirms the conversion map generated in the conversion map generation unit 4 using the output unit 11 such as a monitor connected to the computer or the input unit 12 such as a keyboard. Then, a conversion map may be presented to the user so that the verification can be performed, or a user input for enabling the user to correct or edit the mapping for the presented conversion map may be accepted.

次に、図１１のフローチャートを参照しながら、出力文書生成部としての変換実行部５の処理手順を説明する。図１１は、変換実行部５の処理の流れの一例を示すフローチャートである。まず、前段までに生成された変換写像を含む属性解析結果と変換ルールとを読み込む（Ｓ３１）。このとき事前に用意された変換ルールからは図６に示す各種出力制約情報１０２，１０３を読み込む。 Next, the processing procedure of the conversion execution unit 5 as the output document generation unit will be described with reference to the flowchart of FIG. FIG. 11 is a flowchart illustrating an example of a process flow of the conversion execution unit 5. First, an attribute analysis result including a conversion map generated up to the previous stage and a conversion rule are read (S31). At this time, various output constraint information 102 and 103 shown in FIG. 6 are read from the conversion rules prepared in advance.

図６に例として挙げられた出力制約情報では、“<構造例>”タグで囲まれた内部に変換後の構造が指定されている。まず、“<構造例例=“1”>”で囲まれた１０２の要素内部を見ると、その内部にはそれぞれ４つのタグ<serialno>、<item>、<frequency>及び<description>を有し、さらにそれぞれがその子ノードとしてテキストノード（各値はそれぞれ“$count”、“「項目名」”、“「ラベル」”、“「セル内容」”）を有している。 In the output constraint information given as an example in FIG. 6, the structure after conversion is specified inside the “<structure example>” tag. First, if you look inside the 102 elements enclosed in “<Structure Example =“ 1 ”>”, there are 4 tags <serialno>, <item>, <frequency>, and <description> respectively. Furthermore, each has a text node (each value is “$ count”, ““ item name ””, ““ label ””, ““ cell contents ””) as its child node.

このようにテキストノード中にクラス分類の結果が含まれている場合には、テキストノードで用いられているクラス分類の結果と、属性解析結果のクラス分類表記に同一のものがあるかどうかの検証が行われる。テキストノード中にクラス分類表記を含まない場合には、変換ルール中に記載された変換後構造を、新たな構造として既存構造と置き換える。検証の結果、テキストノードのクラス分類結果と、属性解析結果に含まれるクラス分類結果とが同じであれば、このテキストノードは属性解析結果中の対応するテキスト要素で置き換えられる。なお、変換ルール中のテキストノードで“$count”の表記が存在する場合には、この構造が参照されるごとに増加する数値で置き換えられるものとする。この処理は、対応する属性解析結果中のテキスト要素で未処理のものがある限り、続けられる。 In this way, when the classification result is included in the text node, it is verified whether the classification result used in the text node and the classification classification notation in the attribute analysis result are the same. Is done. When the text node does not include a class classification notation, the post-conversion structure described in the conversion rule is replaced with an existing structure as a new structure. As a result of the verification, if the class classification result of the text node is the same as the class classification result included in the attribute analysis result, the text node is replaced with the corresponding text element in the attribute analysis result. In addition, when the notation “$ count” exists in the text node in the conversion rule, it is replaced with a numerical value that increases each time this structure is referenced. This process continues as long as there is an unprocessed text element in the corresponding attribute analysis result.

まず、“<構造例>”タグの“例”属性に記載された値を読み込む（Ｓ３２）。そして、“例”属性が“１”であるか否かが判断される（Ｓ３３）。図６の場合は、出力制約情報１０２の“<構造例>”タグの“例”属性として“１”が設定されているため、Ｓ３３でＹＥＳとなり、Ｓ３４の処理へ進む。Ｓ３４では、構造例タグ中に出現する最終のテキスト属性「セル内容」が文書中に出現する度に、出力文書を繰り返し出力させるものとする。言い換えると、出力文書の出力回数は、テキスト属性「セル内容」が文書中に出現する回数となる。この場合、出力文書とは出力文書の構造の情報であり、変換実行部５から出力される出力文書は、構造化文書に変換できる形式の文書である。 First, the value described in the “example” attribute of the “<structure example>” tag is read (S32). Then, it is determined whether or not the “example” attribute is “1” (S33). In the case of FIG. 6, since “1” is set as the “example” attribute of the “<structure example>” tag of the output constraint information 102, YES is obtained in S 33 and the process proceeds to S 34. In S34, every time the final text attribute “cell contents” appearing in the structure example tag appears in the document, the output document is repeatedly output. In other words, the output count of the output document is the number of times the text attribute “cell contents” appears in the document. In this case, the output document is information on the structure of the output document, and the output document output from the conversion execution unit 5 is a document that can be converted into a structured document.

この際、繰り返し出力の回数となる「セル内容」の出現回数と、「項目名」や「ラベル」の出現回数とは一致しないが、出力文書構造を出力する場合、この変換の定義として「項目名」は、直前に利用した内容を繰り返し適用するものとする。また「ラベル」は３組の繰り返しとなっていることが分かっているため、「セル内容」の（空要素も含めた）内容に対し、３つ組の各要素を順に掛けあわせて、すなわちセル内容毎に他の属性値をセットにして変換後構造の出力構造とする。 At this time, the number of occurrences of “cell contents”, which is the number of times of repeated output, does not match the number of occurrences of “item name” and “label”. For “Name”, the content used immediately before is repeatedly applied. In addition, since it is known that “Label” consists of 3 sets of repetitions, the contents of “Cell contents” (including empty elements) are multiplied by each element of 3 sets in order, ie, cell Set other attribute values for each content as the output structure of the converted structure.

例えば、図７において、ＩＤ６のテキスト要素は、項目名であり、この直前のＩＤ３からＩＤ５のテキスト要素は連続したラベルを有しているので、表形式の横方向は、ＩＤ３、ＩＤ４及びＩＤ５のテキスト要素がもつ３つのラベルが対応し、表形式の縦方向は、ＩＤ６のテキスト要素が持つ項目名が対応すると仮定される。そして、ＩＤ７の属性とテキスト内容は、セル内容“発疹、痛痒感等”であるため、図１２に示すように、構造例１の出力制約情報１０２に合致した形式で、次の出力文書を生成し出力する。図１２は、変換対象箇所に対する最終的な出力文書構造の具体例を示す図である。 For example, in FIG. 7, the text element of ID6 is an item name, and the text elements of ID3 to ID5 immediately before this have continuous labels, so the horizontal direction of the table format is ID3, ID4, and ID5. It is assumed that the three labels of the text element correspond to each other, and the item name of the ID6 text element corresponds to the vertical direction of the table format. Since the attribute of ID7 and the text contents are the cell contents “rash, painful feeling, etc.”, as shown in FIG. 12, the next output document is generated in a format that matches the output constraint information 102 of Structural Example 1. And output. FIG. 12 is a diagram showing a specific example of the final output document structure for the conversion target portion.

なお、前段の属性解析部で付与された属性によっては、変換ルール中に指定のない場合がある（この例では「その他」属性が付与されたテキスト要素が該当する）。このように変換ルールの中に対応する属性の指定がない場合には、これに対応するテキスト要素は、最終的な出力文書には出現することができないため、結果として除去されることになる。 Depending on the attribute assigned by the attribute analysis unit in the previous stage, there is a case where there is no designation in the conversion rule (in this example, a text element assigned with the “other” attribute corresponds). As described above, when there is no designation of the corresponding attribute in the conversion rule, the corresponding text element cannot appear in the final output document, and is therefore removed as a result.

<serialno>1</serialno>
<item>過敏症</item>
<frequency>0.1〜5％未満</frequency>
<description>発疹、痛痒感等</description> ・・・出力（１）
次に、ＩＤ８、ＩＤ９及びＩＤ１１は、セル内容は空要素であるため、構造例１の出力制約情報１０２に合致した形式で、次の出力文書を生成した後、棄却され出力されない。 <serialno> 1 </ serialno>
<item> Hypersensitivity </ item>
<frequency><0.1-5%</frequency>
<description> Rash, painful sensation, etc. </ description> ・・・ Output (1)
Next, since the cell contents of ID8, ID9, and ID11 are empty elements, after the next output document is generated in a format that matches the output constraint information 102 of Structural Example 1, it is rejected and not output.

次に、ＩＤ１２はセル内容“皮下出血、血尿等”であるため、構造例１の出力制約情報１０２に合致した形式で、次の出力文書を生成し出力する。 Next, since the ID 12 is the cell content “subcutaneous bleeding, hematuria, etc.”, the next output document is generated and output in a format that matches the output constraint information 102 of the structural example 1.

<serialno>2</serialno>
<item>出血傾向</item>
<frequency>0.1％未満</frequency>
<description>皮下出血、血尿等</description> ・・・出力（２）
同様にして、ＩＤ１３は空要素であるため、構造例１の出力制約情報１０２に合致した形式で、次の出力文書を生成した後、棄却され出力されない。 <serialno> 2 </ serialno>
<item> Bleeding tendency </ item>
<frequency><0.1%</frequency>
<description> Subcutaneous hemorrhage, hematuria, etc. </ description> ・・・ Output (2)
Similarly, since ID13 is an empty element, it is rejected and not output after the next output document is generated in a format that matches the output constraint information 102 of Structural Example 1.

同様にして、ＩＤ１５はセル内容“貧血等”であるため、構造例１の出力制約情報１０２に合致した形式で、次の出力文書を生成し出力する。 Similarly, since the ID 15 is the cell content “anemia etc.”, the next output document is generated and output in a format that matches the output constraint information 102 of the structural example 1.

<serialno>3</serialno>
<item>血液</item>
<frequency>0.1〜5％未満</frequency>
<description>貧血等</description> ・・・出力（３）
同様にして、ＩＤ１６及びＩＤ１７は空要素であるため、構造例１の出力制約情報１０２に合致した形式で、次の出力文書を生成した後、棄却され出力されない。 <serialno> 3 </ serialno>
<item> Blood </ item>
<frequency><0.1-5%</frequency>
<description> Anemia etc. </ description> ・・・ Output (3)
Similarly, since ID16 and ID17 are empty elements, after the next output document is generated in a format that matches the output constraint information 102 of Structural Example 1, it is rejected and not output.

以上のようにして、ラベルＩＤ３−５に基づいて上記の３つの出力が得られる。同様に、ラベルＩＤ１９−２１に基づいて、次のような出力が得られる。 As described above, the above three outputs are obtained based on the label ID 3-5. Similarly, the following output is obtained based on the label ID 19-21.

<serialno>4</serialno>
<item>消化器</item>
<frequency>0.1〜5％未満</frequency>
<description>悪心、胃部不快感、下痢</description> ・・・出力（４）
<serialno>5</serialno>
<item>消化器</item>
<frequency>0.1％未満</frequency>
<description>嘔吐、食欲不振、便秘、口内炎等</description> ・・・出力（５）
<serialno>6</serialno>
<item>肝臓</item>
<frequency>0.1〜5％未満</frequency>
<description>肝機能障害</description> ・・・出力（６）
<serialno>7</serialno>
<item>肝臓</item>
<frequency>頻度不明</frequency>
<description>黄疸</description> ・・・出力（７）
<serialno>8</serialno>
<item>腎臓</item>
<frequency>0.1％未満</frequency>
<description>BUN-クレアチンの上昇</description> ・・・出力（８）
<serialno>9</serialno>
<item>その他</item>
<frequency>0.1〜5％未満</frequency>
<description>CK(CPK)の上昇</description> ・・・出力（９）
<serialno>10</serialno>
<item>その他</item>
<frequency>0.1％未満</frequency>
<description>頭痛・頭重感等</description> ・・・出力（１０）
<serialno>11</serialno>
<item>その他</item>
<frequency>頻度不明</frequency>
<description>女性化乳房</description> ・・・出力（１１）
以上のように、出力制約情報１０２の“<構造例>”タグの“例”属性として“１”が設定された出力制約情報について、この出力制約情報に合致した形式で出力文書が生成される。この処理は、入力文書中にセル内容が出現している箇所に対し、繰り返し適用される。その結果、出力制約情報に合致した形式で出力文書を出力する。 <serialno> 4 </ serialno>
<item> Digestive </ item>
<frequency><0.1-5%</frequency>
<description> Nausea, stomach discomfort, diarrhea </ description> ・・・ Output (4)
<serialno> 5 </ serialno>
<item> Digestive </ item>
<frequency><0.1%</frequency>
<description> Vomiting, loss of appetite, constipation, stomatitis, etc. </ description> ・・・ Output (5)
<serialno> 6 </ serialno>
<item> Liver </ item>
<frequency><0.1-5%</frequency>
<description> Liver dysfunction </ description> ・・・ Output (6)
<serialno> 7 </ serialno>
<item> Liver </ item>
<frequency> Unknown frequency </ frequency>
<description> Twilight </ description> ・・・ Output (7)
<serialno> 8 </ serialno>
<item> Kidney </ item>
<frequency><0.1%</frequency>
<description> BUN-creatine rise </ description> ・・・ Output (8)
<serialno> 9 </ serialno>
<item> Other </ item>
<frequency><0.1-5%</frequency>
<description> Increase of CK (CPK) </ description> ... Output (9)
<serialno> 10 </ serialno>
<item> Other </ item>
<frequency><0.1%</frequency>
<description> Headache, headache, etc. </ description> ・・・ Output (10)
<serialno> 11 </ serialno>
<item> Other </ item>
<frequency> Unknown frequency </ frequency>
<description> Gynecomastia </ description> ・・・ Output (11)
As described above, for output constraint information in which “1” is set as the “example” attribute of the “<structure example>” tag of the output constraint information 102, an output document is generated in a format that matches the output constraint information. . This process is repeatedly applied to the location where the cell contents appear in the input document. As a result, the output document is output in a format that matches the output constraint information.

次に、“例”属性が“２”であるか否かが判断される（Ｓ３５）。図６の場合は、出力制約情報１０３の“<構造例>”タグの“例”属性として“２”が設定されているため、Ｓ３５でＹＥＳとなり、Ｓ３６の処理へ進む。 Next, it is determined whether or not the “example” attribute is “2” (S35). In the case of FIG. 6, since “2” is set as the “example” attribute of the “<structure example>” tag of the output constraint information 103, YES is obtained in S 35 and the process proceeds to S 36.

Ｓ３６では、“<構造例例=“2”>”で囲まれた内部要素１０３を見ると、この内部には２つのタグ<serialno>及び<注意事項>を有し、さらにそれぞれがその子要素としてテキストノード（各値はそれぞれ“$count”、“「コメント行」”）を有している。 In S36, when the internal element 103 surrounded by “<example of structure =“ 2 ”>” is seen, this tag has two tags <serialno> and <note>, each of which is a child element thereof. It has text nodes (each value is “$ count”, ““ comment line ””).

従って、出力制約情報１０３の“<構造例>”タグの“例”属性として“２”が設定された出力制約情報について、その出力制約情報に合致した形式で出力文書が生成される。具体的には、構造例に記載された末尾のテキスト属性、すなわちコメント行が入力文書中に出現する限り、その構造例のテンプレートとしての出力制約情報を適用して、テキスト要素を用いて出力制約情報に合致した形式で出力文書を出力する。出力文書の例は、図１３に示す。 Therefore, for the output constraint information in which “2” is set as the “example” attribute of the “<structure example>” tag of the output constraint information 103, an output document is generated in a format that matches the output constraint information. Specifically, as long as the text attribute at the end described in the structure example, that is, a comment line, appears in the input document, the output restriction information as a template of the structure example is applied, and the output restriction using the text element is applied. Output documents in a format that matches the information. An example of the output document is shown in FIG.

以下同様に、変換ルールに定義された次の“例”属性について、変換実行部５は、変換処理を行っていき、最後の“<構造例>”タグの“例”属性として“ｎ”に対して変換処理を行うと（Ｓ３７，Ｓ３８）、変換後構造の文書によって既存の構造の文書を置き換えて出力文書とする（Ｓ３９）。 Similarly, for the next “example” attribute defined in the conversion rule, the conversion execution unit 5 performs the conversion process and sets “n” as the “example” attribute of the last “<structure example>” tag. When the conversion process is performed (S37, S38), the document having the existing structure is replaced with the document having the converted structure to obtain an output document (S39).

結果として、図１２に示す出力文書が生成される。図１２に示す出力文書は、テキストノードにクラス分類の表記がある箇所が、テキスト要素に置き換えて出力された形式のものである。また、テキスト要素の置き換えには、変換写像生成部４によって付与されたクラス分類（属性値）が参照される。 As a result, the output document shown in FIG. 12 is generated. The output document shown in FIG. 12 is in a format in which a portion having a class classification notation in a text node is replaced with a text element and output. For the replacement of the text element, the class classification (attribute value) given by the conversion map generation unit 4 is referred to.

以上のように、本実施の形態によれば、入力文書の形式にバリエーションが生じている場合でも、ユーザは従来のように入力文書毎のルールを作成する必要がなく、文書変換に掛かるコストを低減できる。 As described above, according to the present embodiment, even when variations occur in the format of the input document, the user does not need to create a rule for each input document as in the past, and the cost for document conversion can be reduced. Can be reduced.

また、例えば、ＯＣＲソフトウエアで読み込んだ表形式部分は、カンマで区切られているのか、一行で出力される等、ＯＣＲソフトウエアによって出力形態が異なっている。このように入力文書の形式にバリエーションがあっても、本実施の形態の文書処理装置によれば、適切にＸＭＬ等の構造化文書を出力することができる。 Further, for example, the table format portion read by the OCR software has different output forms depending on the OCR software, such as being separated by commas or being output in one line. As described above, even if there are variations in the format of the input document, the document processing apparatus according to the present embodiment can appropriately output a structured document such as XML.

さらに、本実施の形態の文書処理装置によれば、既存の様々なテキスト文書を半自動で応用規格に基づいた構造化文書（ＸＭＬ文書）へ変換できるため、表形式の部分の情報の再利用や検索が容易に実現できる。 Furthermore, according to the document processing apparatus of the present embodiment, various existing text documents can be converted semi-automatically into a structured document (XML document) based on an application standard. Search can be realized easily.

以上のように、本実施の形態によれば、入力文書中から表形式の所定の範囲を適切に特定して抽出し、かつ、その抽出範囲に対し所望の出力文書を作成することが出来る。さらに、同一タグ名の出現回数や順序、出現位置といった細かな制約が考慮されるので、複雑な変換が可能となる。 As described above, according to the present embodiment, it is possible to appropriately specify and extract a predetermined range in a tabular format from an input document, and create a desired output document for the extracted range. Further, since detailed restrictions such as the number of appearances, the order, and the appearance position of the same tag name are taken into account, complicated conversion is possible.

本発明は、上述した実施の形態に限定されるものではなく、本発明の要旨を変えない範囲において、種々の変更、改変等が可能である。 The present invention is not limited to the above-described embodiment, and various changes and modifications can be made without departing from the scope of the present invention.

本発明の実施の形態に係る文書構造化処理装置の構成図を示すブロック図である。It is a block diagram which shows the block diagram of the document structure processing apparatus which concerns on embodiment of this invention. 本発明の実施の形態に係る入力文書の例を示す図である。It is a figure which shows the example of the input document which concerns on embodiment of this invention. 本発明の実施の形態に係るテキストデータに変換された入力文書の例を示す図である。It is a figure which shows the example of the input document converted into the text data based on embodiment of this invention. 本発明の実施の形態に係る構造化処理の全体の流れの例を示すフローチャートである。It is a flowchart which shows the example of the whole flow of the structured process which concerns on embodiment of this invention. 本発明の実施の形態に係る属性解析部の処理の流れの例を示すフローチャートである。It is a flowchart which shows the example of the flow of a process of the attribute analysis part which concerns on embodiment of this invention. 変換ルールの記述例を示す。A description example of the conversion rule is shown. テキスト要素リストの例を示す図である。It is a figure which shows the example of a text element list. 知識辞書部に定義された項目名とラベルを抽出するための情報例を説明するための図である。It is a figure for demonstrating the example of information for extracting the item name and label defined in the knowledge dictionary part. 属性付与処理の流れの例を示すフローチャートである。It is a flowchart which shows the example of the flow of an attribute provision process. 付与された属性値とテキストIDの具体例を示す図である。It is a figure which shows the specific example of the provided attribute value and text ID. 変換実行部の処理の流れの例を示すフローチャートである。It is a flowchart which shows the example of the flow of a process of a conversion execution part. 変換対象箇所に対する最終的な出力文書構造の具体例を示した図である。It is the figure which showed the specific example of the final output document structure with respect to the conversion object location. 変換対象箇所に対する最終的な出力文書構造の他の具体例を示した図である。It is the figure which showed the other specific example of the final output document structure with respect to the conversion object location.

Explanation of symbols

１…文書構造化処理装置、２…変換対象箇所特定部、３…属性解析部、４…変換写像生成部、５…変換実行部、６…知識辞書部、７…知識辞書部、８…変換ルール解析部、９…変換ルール部、１０…形態素解析部。 DESCRIPTION OF SYMBOLS 1 ... Document structuring processing apparatus, 2 ... Conversion object location identification part, 3 ... Attribute analysis part, 4 ... Conversion mapping production | generation part, 5 ... Conversion execution part, 6 ... Knowledge dictionary part, 7 ... Knowledge dictionary part, 8 ... Conversion Rule analysis unit, 9 ... conversion rule unit, 10 ... morpheme analysis unit.

代理人弁理士伊藤進 Agent Patent Attorney Susumu Ito

Claims

A knowledge dictionary part that defines attribute assignment information for assigning attribute information to a predetermined character string;
A conversion rule part that defines output constraint information for converting the conversion target part of the input document and outputting the output document;
The presence or absence of the predetermined character string is determined for the character string that appears at the specified location to be converted, and the predetermined character string is supported based on the attribute assignment information of the knowledge dictionary unit An attribute assigning unit for assigning the attribute information;
From the attribute information provided by the attribute assigning unit, determine the regularity of appearance of predetermined attribute information defined in the output constraint information, and based on the determined regularity of appearance , A conversion map generation unit for generating a conversion map of the output constraint information;
A document processing apparatus, comprising: an output document generation unit configured to generate and output an output document in a format that matches the output constraint information with respect to the information of the conversion map.

The document processing according to claim 1, further comprising: a conversion target location specifying unit that characterizes the conversion target location of the input document, wherein the conversion target location is specified by the conversion target location specifying unit. apparatus.

The conversion target portion includes a character string having a tabular structure,
The conversion map generation unit sets the number of repeated appearances of the predetermined attribute information as the number of repeated appearances of the cell contents in the table format, and according to the number of repeated appearances, the character string that appears in the conversion target location 3. The document processing apparatus according to claim 1, wherein the predetermined character string is determined to indicate the cell contents, and the conversion map is generated.

The document processing apparatus according to claim 3, wherein the output document generation unit does not output the output document including the cell content when the cell content is an empty element.

The attribute assignment unit includes a morpheme analysis unit that performs a morpheme analysis on a character string that appears in the conversion target portion, and assigns the attribute information using an analysis result of the morpheme analysis unit. The document processing apparatus according to any one of claims 1 to 4.

Define attribute assignment information for assigning attribute information to a predetermined character string,
Define output constraint information to convert the conversion target part of the input document and output the output document,
The presence or absence of the predetermined character string is determined for the character element appearing in the specified location to be converted, and the attribute information corresponding to the predetermined character string is determined based on the attribute assignment information. Grant,
The regularity of appearance of predetermined attribute information defined in the output constraint information is determined from the assigned attribute information, and the conversion mapping of the output constraint information is based on the determined regularity Produces
A document processing method comprising: generating and outputting an output document in a format that matches the generated output mapping information with the output constraint information.

A computer program for document conversion,
A function for defining attribute assignment information for assigning attribute information to a predetermined character string;
A function for defining output constraint information for performing conversion on the conversion target part of the input document and outputting the output document;
The presence or absence of the predetermined character string is determined for the character element appearing in the specified location to be converted, and the attribute information corresponding to the predetermined character string is determined based on the attribute assignment information. The functions to grant,
The regularity of appearance of predetermined attribute information defined in the output constraint information is determined from the assigned attribute information, and the conversion mapping of the output constraint information is based on the determined regularity With the ability to generate
A document conversion program that causes the computer to realize a function of generating and outputting an output document in a format that matches the output constraint information with respect to the generated conversion mapping information.