JP2001195400A

JP2001195400A - Method and device for structuralizing document context

Info

Publication number: JP2001195400A
Application number: JP2000008187A
Authority: JP
Inventors: Kaori Inoue; 香織井上; Seiji Yokomichi; 誠司横路; Katsumi Takahashi; 克己高橋
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2000-01-17
Filing date: 2000-01-17
Publication date: 2001-07-19

Abstract

PROBLEM TO BE SOLVED: To provide the document context structuralizing method and device which can divide the intra-document information for every written object and can describe the inter-attribute relation for identifying the reverting destination of every attribute when every semantic attribute appears in plural numbers in a document semantic structuralization mode. SOLUTION: The attributes are given to the character strings included in an inputted still non-structuralized document. The stile non-structuralized document is decided as a document that is written about plural objects among the sets of attributes when the same attribute name appear in plural numbers. Then the structuralization is carried out in plural contexts.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、文書文脈構造化方
法及び装置に係り、特に、文書内の意味的な文脈に基づ
いた構造化を行うことで、文脈を考慮した情報の意味的
な検索や意味的に類似した情報の統合、分類を行うため
の文書文脈構造化方法及び装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a method and apparatus for structuring a document context, and more particularly, to performing a structuring based on a semantic context in a document, thereby semantically retrieving information considering the context. The present invention relates to a method and apparatus for structuring a document context for integrating and classifying information that is semantically similar.

【従来の技術】文書の構造化において、属性間の関係を
表すためには、構造化装置が、文書内の情報を定められ
たルールに基づいて、いつかに分割し、分割された領域
間の関係構造と領域内の属性間の関係構造を抽出し、属
性間の関係を表している。2. Description of the Related Art In structuring a document, in order to represent a relationship between attributes, a structuring apparatus divides information in a document sometime based on a predetermined rule, and divides information between the divided areas. The relation structure between the relation structure and the attribute in the area is extracted, and the relation between the attributes is expressed.

【０００２】文書をその表層的な構造で構造化した場
合、例えば、文書の段階の位置で、文書内の情報を分割
し、段落の規模の大きさ（大段落、中段落、小段落）を
用いて、段落間の関係構造を示し、また、属性の関係を
定義して、１段落内の属性間の関係を示す。[0002] When a document is structured with its surface structure, for example, information in the document is divided at the position of the document and the size of the paragraph (large paragraph, middle paragraph, small paragraph) is determined. Is used to show the relational structure between paragraphs, and also to define the relation of attributes to show the relation between attributes in one paragraph.

【０００３】文書をその論理的な構造で構造化した場
合、例えば、接続詞の位置で、文書内の情報を分割し、
接続詞の示す論理構造を用いて、分割された領域の論理
構造を示し、また、属性の関係を定義して、１段落内の
属性間の関係を示す。[0003] When a document is structured by its logical structure, for example, information in the document is divided at the position of a conjunction, and
The logical structure indicated by the conjunction is used to indicate the logical structure of the divided area, and the attribute relation is defined to indicate the relation between the attributes in one paragraph.

【０００４】[0004]

【発明が解決しようとする課題】しかしながら、上記従
来の技術における方法は、論理構造の場合も形式構造の
場合も、１文書の論理構造や形式構造は、一定であり、
ある構造化ルールに基づいて構造化された結果は１通り
であり、文書の意味的な複数の文脈に基づいた構造化を
行うことができないという問題がある。However, in the above-mentioned conventional method, the logical structure and the formal structure of one document are constant in both the logical structure and the formal structure.
There is only one result structured based on a certain structuring rule, and there is a problem that structuring cannot be performed based on a plurality of semantic contexts of a document.

【０００５】本発明は、上記の点に鑑みなされたもの
で、文書の意味的構造化において、各意味属性が複数現
れることがあるが、各属性の帰属先を識別するために、
文書内情報を書かれている対象毎に分割し、属性間の関
係を記述することが可能な文書文脈構造化方法及び装置
を提供することを目的とする。The present invention has been made in view of the above points, and a plurality of semantic attributes may appear in the semantic structuring of a document.
It is an object of the present invention to provide a document context structuring method and apparatus capable of dividing information in a document for each written object and describing a relationship between attributes.

【０００６】また、本発明の目的は、文書内情報を分割
する際に、複数の文脈での分割を行うことで、様々な文
脈での検索が可能となる意味的構造化を可能とする文書
文脈構造化方法及び装置を提供することである。Another object of the present invention is to divide a document into a plurality of contexts when dividing information in a document, thereby enabling a document having a semantic structuring that enables retrieval in various contexts. It is to provide a context structuring method and apparatus.

【０００７】[0007]

【課題を解決するための手段】図１は、本発明の原理を
説明するための図である。FIG. 1 is a diagram for explaining the principle of the present invention.

【０００８】本発明（請求項１）は、文書内の意味的な
文脈に基づいた構造化を行うことで、文脈を考慮した情
報の意味的な検索や意味的に類似した情報の統合、分類
を行うための文書文脈構造化方法において、未構造化文
書内の文字列に対し、属性を付与し（ステップ１）、該
属性の集合のうち、同じ属性名が複数現れるときに（ス
テップ２）、該未構造化文書が複数対象について書かれ
ている文書であると判断し、複数文脈で構造化を行う
（ステップ３）。The present invention (Claim 1) performs a structuring based on a semantic context in a document, thereby performing semantic search of information in consideration of the context, integration and classification of semantically similar information. In the document context structuring method for performing the above, an attribute is assigned to the character string in the unstructured document (step 1), and when a plurality of the same attribute names appear in a set of the attributes (step 2) It is determined that the unstructured document is a document in which a plurality of objects are written, and structuring is performed in a plurality of contexts (step 3).

【０００９】本発明（請求項２）は、同じ属性名が複数
現れた際に、複数現れた属性名を文書内に現れる順に並
べた属性名列を作成し、属性名列の中で繰り返されてい
る１つ以上の属性パターンを抽出し、属性パターンごと
に、属性名列を分割し、文書内の構造化情報を複数のセ
グメントに分割する。According to the present invention (claim 2), when a plurality of the same attribute names appear, an attribute name sequence is created by arranging the plurality of attribute names in the order in which they appear in the document, and is repeated in the attribute name sequence. One or more attribute patterns are extracted, an attribute name string is divided for each attribute pattern, and the structured information in the document is divided into a plurality of segments.

【００１０】本発明（請求項３）は、属性パターンを、
予め文脈の概念（クラス）に対応づけてデータベースに
格納し、抽出される属性パターンに基づいてデータベー
スを参照し、文書にクラスを付与する。According to the present invention (claim 3), the attribute pattern
The data is stored in the database in advance in association with the concept (class) of the context, and the database is referred to based on the extracted attribute pattern, and a class is assigned to the document.

【００１１】本発明（請求項４）は、すべての文書にク
ラスとセグメントを付与し、該文書内の全ての属性に対
し、クラスＩＤとセグメントＩＤを付与し、各属性を一
意に識別して出力する。According to the present invention (claim 4), a class and a segment are assigned to all documents, a class ID and a segment ID are assigned to all attributes in the document, and each attribute is uniquely identified. Output.

【００１２】本発明（請求項５）は、属性名列からパタ
ーンの入れ子が見つかった場合には、複数の属性パター
ンに応じて、セグメントに分割した結果を階層的に保持
し、各セグメントに対して文書ＩＤとクラスＩＤを付与
し、各個別のセグメント毎に階層関係を出力する。According to the present invention (claim 5), when nesting of a pattern is found from an attribute name string, the result of segmentation according to a plurality of attribute patterns is stored hierarchically, and Then, a document ID and a class ID are assigned, and a hierarchical relationship is output for each individual segment.

【００１３】本発明（請求項６）は、１つの文書内の情
報に対して付与される属性、属性を並べて作成される属
性名列、属性名列から抽出される属性パターンを１つ以
上とし、１文書に対して付与されるクラスが１つ以上で
あり、１文書に対し、１つ以上の文脈での構造化を行
う。According to the present invention (claim 6), an attribute given to information in one document, an attribute name string created by arranging the attributes, and an attribute pattern extracted from the attribute name string are one or more. One or more classes are assigned to one document, and one document is structured in one or more contexts.

【００１４】図２は、本発明の原理構成図である。FIG. 2 is a diagram showing the principle of the present invention.

【００１５】本発明（請求項７）は、文書内の意味的な
文脈に基づいた構造化を行うことで、文脈を考慮した情
報の意味的な検索や意味的に類似した情報の統合、分類
を行うための文書文脈構造化装置であって、未構造化文
書を入力する文書入力手段１１と、未構造化文書内の文
字列に対し、属性を付与する属性付与手段１３と、属性
の集合のうち、同じ属性名が複数現れるときに、該文書
が複数対象について書かれている文書であると判断し、
複数の文脈で構造化を行うパターン分割手段１６とを有
する。The present invention (Claim 7) provides a structure based on a semantic context in a document, thereby performing a semantic search of information in consideration of the context, integration of semantically similar information, and classification. A document context structuring device for inputting an unstructured document, an attribute assigning unit 13 for assigning an attribute to a character string in the unstructured document, and a set of attributes When the same attribute name appears more than once, it is determined that the document is a document written for a plurality of objects,
Pattern dividing means 16 for structuring in a plurality of contexts.

【００１６】本発明（請求項８）は、同じ属性名が複数
現れた際に、複数現れた属性名を文書内に現れる順に並
べた属性名列を作成し、属性名列の中で繰り返されてい
る１つ以上の属性パターンを抽出するパターン抽出手段
１４を更に有し、パターン分割手段１６において、属性
パターンごとに、属性名列を分割し、文書内の構造化情
報を複数のセグメントに分割する手段を有する。According to the present invention (claim 8), when a plurality of the same attribute names appear, an attribute name sequence is created by arranging the plurality of attribute names in the order in which they appear in the document, and is repeated in the attribute name sequence. Pattern extracting means 14 for extracting one or more attribute patterns. The pattern dividing means 16 divides an attribute name string for each attribute pattern, and divides structured information in a document into a plurality of segments. Have means to do so.

【００１７】本発明（請求項９）は、パターン抽出手段
１４において、属性パターンを、予め文脈の概念（クラ
ス）に対応づけてデータベース１２に格納する手段を有
し、パターン分割手段１６において、抽出される属性パ
ターンに基づいてデータベース１２を参照し、文書にク
ラスを付与する手段を有する。According to the present invention (claim 9), the pattern extracting means 14 has means for storing the attribute pattern in the database 12 in advance in association with the concept (class) of the context. It has means for referring to the database 12 based on the attribute pattern to be given and assigning a class to the document.

【００１８】本発明（請求項１０）は、パターン分割手
段１６において、すべての文書にクラスとセグメントを
付与し、該文書内の全ての属性に対し、クラスＩＤとセ
グメントＩＤを付与する手段を有し、各属性を一意に識
別して出力する出力手段１７を有する。According to the present invention (claim 10), the pattern dividing means 16 has means for assigning a class and a segment to all documents and assigning a class ID and a segment ID to all attributes in the document. And an output unit 17 for uniquely identifying and outputting each attribute.

【００１９】本発明（請求項１１）は、パターン分割手
段１６において、属性名列からパターンの入れ子が見つ
かった場合には、複数の属性パターンに応じて、セグメ
ントに分割した結果を階層的に保持し、各セグメントに
対して文書ＩＤとクラスＩＤを付与する手段を有し、出
力手段１７において、各個別のセグメント毎に階層関係
を出力する手段を有する。According to the present invention (claim 11), when a pattern nest is found from an attribute name string in the pattern dividing means 16, the result of dividing into segments according to a plurality of attribute patterns is stored hierarchically. Then, it has a means for assigning a document ID and a class ID to each segment, and the output means 17 has a means for outputting a hierarchical relationship for each individual segment.

【００２０】本発明（請求項１２）は、パターン分割手
段１６において、１つの文書内の情報に対して付与され
る属性、属性を並べて作成される属性名列、属性名列か
ら抽出される属性パターンを１つ以上とし、１文書に対
して付与されるクラスが１つ以上であり、１文書に対
し、１つ以上の文脈での構造化を行う手段を有する。According to the present invention (claim 12), in the pattern dividing means 16, an attribute given to information in one document, an attribute name string created by arranging attributes, and an attribute extracted from the attribute name string There are one or more patterns, one or more classes assigned to one document, and a means for structuring one document in one or more contexts.

【００２１】上記のように、本発明は、文書内の構造化
において、意味的な構造化を可能とする。ここで、構造
化とは、文書内の情報に属性を与え、その属性間の関係
を示すことである。本発明は、属性間の関係について、
文脈に応じた属性の関係を示すものである。これによ
り、文書の意味的構造化において、各意味属性が複数現
れることがあるが、各属性の帰属先を識別するために、
文書内情報を書かれている対象毎に分割し、属性間の関
係性を記述することが可能となる。As described above, the present invention enables semantic structuring in structuring within a document. Here, structuring means giving attributes to information in a document and showing a relationship between the attributes. The present invention describes the relationship between attributes.
It shows the relationship between attributes according to the context. As a result, in the semantic structuring of a document, a plurality of each semantic attribute may appear.
It is possible to divide the information in the document for each written object and to describe the relationship between the attributes.

【００２２】[0022]

【発明の実施の形態】まず、文脈を表す概念として、ク
ラスを複数定義しておく。各クラスには対応する意味属
性が定められている。また、各クラス間の階層関係を定
義しておく。次に、文書内の情報に意味属性を付与す
る。この際、可能性のあるすべてのクラスに対応する意
味属性を付与することで、文脈に応じた意味属性を付与
することができる。つまり、ある情報に対し、複数の意
味属性が付与される。DESCRIPTION OF THE PREFERRED EMBODIMENTS First, a plurality of classes are defined as a concept representing a context. Each class has a corresponding semantic attribute. Also, a hierarchical relationship between the classes is defined. Next, a semantic attribute is added to the information in the document. At this time, by assigning semantic attributes corresponding to all possible classes, semantic attributes according to the context can be assigned. That is, a plurality of semantic attributes are given to certain information.

【００２３】１文書内にある１つの意味属性名が複数現
れるとき、この文書は、複数の対象についての情報を含
んでいる。これら重複する各属性を一意に識別するた
め、対象毎に属性の集合を分割する。分割された部分を
セグメントとし、そのセグメント内の属性の並びパター
ンから、対応するクラスを定義する。ある情報に対する
意味属性は、複数付与されているから、属性の並びパタ
ーンが複数であるため、定義されるクラスも複数にな
る。よって、１つの文書に対し、複数文脈での構造化が
可能となる。When a plurality of one semantic attribute names appear in one document, this document contains information on a plurality of objects. In order to uniquely identify each of these overlapping attributes, a set of attributes is divided for each target. The divided part is defined as a segment, and a corresponding class is defined from the arrangement pattern of the attributes in the segment. Since a plurality of semantic attributes are assigned to certain information, a plurality of attributes are arranged, so that a plurality of classes are defined. Therefore, a single document can be structured in a plurality of contexts.

【００２４】分割された部分には、クラスＩＤと分割さ
れた部分を識別するためのセグメントＩＤが付与され
る。構造化結果として、ある属性に対しクラスＩＤとセ
グメントＩＤを併記することで、文書内の属性を一意に
表すことができる。Each of the divided portions is provided with a class ID and a segment ID for identifying the divided portion. By writing a class ID and a segment ID together for a certain attribute as a structured result, the attribute in the document can be uniquely represented.

【００２５】セグメント中に、さらに複数の対象につい
ての情報が含まれる場合には、下位セグメントに分割
し、下位クラスＩＤと下位セグメントＩＤを付与する。
これにより、セグメント毎の階層関係を定めることがで
きる。各属性は、クラスＩＤとセグメントＩＤが付与さ
れているため、セグメント間の階層関係を定めること
は、属性間の階層関係を定めることと同義である。When a segment further includes information on a plurality of objects, the segment is divided into lower segments and a lower class ID and a lower segment ID are assigned.
This makes it possible to determine the hierarchical relationship for each segment. Since each attribute is given a class ID and a segment ID, defining a hierarchical relationship between segments is synonymous with defining a hierarchical relationship between attributes.

【００２６】[0026]

【実施例】以下、図面と共に本発明の実施例を説明す
る。Embodiments of the present invention will be described below with reference to the drawings.

【００２７】図３は、本発明の一実施例の文書文脈構造
化装置の構成を示す。FIG. 3 shows the configuration of a document context structuring apparatus according to an embodiment of the present invention.

【００２８】同図に示す文書文脈構造化装置１は、文書
入力部１１、文書格納部１２、属性付与部１３、パター
ン抽出部１４、パターン分割部１６、出力部１７、属性
辞書２２、パターンマッチルール２３、及びクラスデー
タベース２４から構成される。The document context structuring apparatus 1 shown in FIG. 1 includes a document input unit 11, a document storage unit 12, an attribute assignment unit 13, a pattern extraction unit 14, a pattern division unit 16, an output unit 17, an attribute dictionary 22, a pattern matcher. It comprises a rule 23 and a class database 24.

【００２９】文書入力部１１は、未構造化文書を入力す
る部分である。The document input section 11 is a section for inputting an unstructured document.

【００３０】文書格納部１２は、未構造化文書を格納す
る。未構造化文書は、各処理を受けて変換され、再度格
納される。The document storage section 12 stores an unstructured document. The unstructured document is converted after receiving each processing, and stored again.

【００３１】属性付与部１３は、属性辞書２２とパター
ンマッチルール２３を参照して、未構造化文書内の文字
列に対し、属性を付与する。The attribute assigning unit 13 assigns an attribute to the character string in the unstructured document with reference to the attribute dictionary 22 and the pattern matching rule 23.

【００３２】属性辞書２２は、文字列属性（名前など）
を抽出するための辞書である。The attribute dictionary 22 stores character string attributes (such as names).
Is a dictionary for extracting.

【００３３】パターンマッチルール２３は、パターン
（電話番号、時間など）属性を抽出するためのルール集
である。The pattern matching rule 23 is a set of rules for extracting pattern (telephone number, time, etc.) attributes.

【００３４】パターン抽出部１４は、属性付与部１３が
付与した属性を出現位置順に並べ、属性名列を作成し、
繰り返されている属性パターンを抽出する。The pattern extracting unit 14 arranges the attributes assigned by the attribute assigning unit 13 in the order of appearance, creates an attribute name sequence,
Extract the attribute pattern that is repeated.

【００３５】クラスデータベース２４は、予めクラス
に、クラスと属性パターンの対応を格納する。クラスに
は、属性パターン（属性の並び）とクラスの対応が格納
されている。また、各クラス間の階層構造かが格納され
ている。The class database 24 stores the correspondence between classes and attribute patterns in the classes in advance. The class stores the correspondence between attribute patterns (arrangement of attributes) and classes. In addition, a hierarchical structure between classes is stored.

【００３６】パターン分割部１６は、パターン抽出部１
４で得られた属性パターンと、クラスデータベース２４
内の属性パターンをマッチングさせ、対応するクラスを
抽出し、属性パターンを元に、属性が付与された文書内
情報をセグメントに分割する部分である。The pattern dividing section 16 includes the pattern extracting section 1
4 and the class database 24
This is a part that matches attribute patterns in the, extracts a corresponding class, and divides in-document information to which an attribute is assigned into segments based on the attribute pattern.

【００３７】出力部１７は、パターン分割部１６が抽出
したクラス情報とセグメント情報を付加した、属性情報
を出力する。The output unit 17 outputs attribute information to which the class information and the segment information extracted by the pattern division unit 16 are added.

【００３８】次に、上記の構成における動作を説明す
る。Next, the operation in the above configuration will be described.

【００３９】図４は、本発明の一実施例の文書文脈構造
化装置の動作を示すフローチャートである。FIG. 4 is a flowchart showing the operation of the document context structuring apparatus according to one embodiment of the present invention.

【００４０】ステップ１０１）文書入力部１１に未構
造化文書を入力し、当該未構造化文書を文書格納部１２
に格納する。Step 101) An unstructured document is input to the document input unit 11, and the unstructured document is stored in the document storage unit 12
To be stored.

【００４１】ステップ１０２）属性付与部１３は、文
書内の文字列に対し、属性辞書２２とパターンマッチル
ール２３を用いて意味属性を付与する。Step 102) The attribute assigning unit 13 assigns a semantic attribute to the character string in the document by using the attribute dictionary 22 and the pattern matching rule 23.

【００４２】例えば、属性辞書２２とのマッチングで、
「○○屋」という文字列に「店名」という属性を付与す
る。パターンマッチングルール２３で「ＸＸＸ−ＸＸＸ
−ＸＸＸＸ」という記号列に「電話番号」という属性を
付与する。ある文字列や記号列に付与される意味属性
は、複数ある。可能性のある意味属性をすべて付与す
る。For example, by matching with the attribute dictionary 22,
An attribute “shop name” is added to the character string “XX shop”. In the pattern matching rule 23, "XXX-XXX
An attribute “telephone number” is added to the symbol string “−XXXX”. There are a plurality of semantic attributes assigned to a certain character string or symbol string. Assign all possible semantic attributes.

【００４３】例えば、「にんじん」には、「農産物」属
性、「材料」属性、「野菜」属性、「小説名」属性など
が付与される。これらの属性間には共起ルール、排他ル
ールが適応されるため、付与された複数の意味属性の数
が絞り込まれる。例えば、「業種」属性の値「農業」と
「農産物」属性の間に共起ルールが定められていた場
合、「業種」属性が「農業」であれば、「にんじん」と
いう文字列には、「農産物」という属性が付与される。
同様に、「業種」属性の値「飲食店」と「材料」属性の
間に共起ルールが定められていた場合、「業種」属性が
「飲食店」の場合には「にんじん」には、「材料」とい
う属性が付与される。For example, "carrot" is provided with a "produce" attribute, a "material" attribute, a "vegetable" attribute, a "fiction name" attribute, and the like. Since a co-occurrence rule and an exclusion rule are applied between these attributes, the number of a plurality of assigned semantic attributes is narrowed down. For example, if a co-occurrence rule is defined between the value of the "industry" attribute "agriculture" and the "agricultural product" attribute, if the "industry" attribute is "agriculture", the character string "carrot" An attribute of "agricultural product" is given.
Similarly, if a co-occurrence rule is defined between the value of the "industry" attribute "restaurant" and the "material" attribute, if the "industry" attribute is "restaurant", "carrot" An attribute “material” is provided.

【００４４】ステップ１０３）次に、パターン抽出部
１４は、同じ意味属性が複数現れる場合・と、１つの意
味属性名は１回しか現れない場合・に分ける。Step 103) Next, the pattern extraction unit 14 classifies a case where a plurality of identical semantic attributes appear and a case where one semantic attribute name appears only once.

【００４５】ステップ１０４）パターン抽出部１４
は、ステップ１０３において・の場合、複数現れる属性
名を文書内で出現する位置順に並べ、属性名列を作成す
る。ある情報に対し、属性は複数付与される可能性があ
るので、属性連続パターンも複数あり得る。各属性連続
パターンに対し、繰り返されている属性パターンを抽出
する。Step 104) Pattern extractor 14
In step 103, in step 103, a plurality of attribute names are arranged in the order in which they appear in the document, and an attribute name sequence is created. Since a plurality of attributes may be given to certain information, there may be a plurality of attribute continuous patterns. For each attribute continuous pattern, a repeated attribute pattern is extracted.

【００４６】以下に抽出方法の一例を説明する。An example of the extraction method will be described below.

【００４７】属性名列の中で、２回以上出現する属性の
並びを抽出する。属性の並びには、属性の重複がないも
のとする。並びは、隣接していなくてもよく、２属性間
に、その２属性と重複がない不特定の０個以上の属性が
現れてもよい。例えば、属性Ａと属性Ｂという２属性間
にその２属性と重複のない不特定の０個以上の属性が現
れる場合を、“Ａ＊Ｂ”と表現する。これらの属性の並
びを属性パターンとする。例えば、属性名列が“ＡＢＣ
ＡＢＣＡＢＣＢ”（１文字が１つの属性名であると仮
定）の場合、重複せずに隣接するすべての２属性以上の
パターンが、“ＡＢ”、“ＢＣ”、“ＣＡ”、“ＡＢ
Ｃ”、“ＢＣＡ”、“ＣＡＢ”であり、重複せずに隣接
していないパターンが“Ａ＊Ｃ”、“Ｂ＊Ａ”、“Ｃ＊
Ｂ”である。ここで、＊は、間に不特定の属性が０個以
上現れることを意味するから、これらを纏めて記述する
と、“Ａ＊Ｃ”と“Ｂ＊Ａ”は、間に特定の文字列しか
現れないので、“ＡＢＣ”、“ＢＣＡ”に吸収され、属
性パターン候補は、“ＡＢ”、“ＢＣ”、“ＣＡ”、
“ＡＢＣ”、“ＢＣＡ”、“ＣＡＢ”、“Ｃ＊Ｂ”とな
る。これら、候補を属性名列から前から順にマッチさせ
ていき、最も多くマッチし、かつ、最も属性数が多いパ
ターンを、属性パターンとする。例では、“ＡＢＣ”が
属性パターンとなる。この抽出されるパターンは、１つ
とは限らず、複数の候補が抽出される可能性もある。A list of attributes appearing twice or more in the attribute name string is extracted. It is assumed that there is no duplication of attributes in the sequence of attributes. The arrangement may not be adjacent to each other, and unspecified zero or more attributes that do not overlap with the two attributes may appear between the two attributes. For example, a case where unspecified zero or more attributes that do not overlap with the two attributes appearing between the attribute A and the attribute B appears as “A * B”. A sequence of these attributes is defined as an attribute pattern. For example, if the attribute name column is "ABC
In the case of “ABCABCB” (assuming that one character is one attribute name), patterns of two or more adjacent attributes without duplication are “AB”, “BC”, “CA”, “AB”.
C, "BCA", and "CAB", and the patterns that are not adjacent without being overlapped are "A * C", "B * A", and "C *".
Here, * means that 0 or more unspecified attributes appear between them. Therefore, if these are collectively described, "A * C" and "B * A" Since only a specific character string appears, it is absorbed by “ABC” and “BCA”, and the attribute pattern candidates are “AB”, “BC”, “CA”,
“ABC”, “BCA”, “CAB”, and “C * B”. These candidates are sequentially matched from the attribute name sequence, and the pattern that matches the most and has the largest number of attributes is set as the attribute pattern. In the example, “ABC” is the attribute pattern. The number of extracted patterns is not limited to one, and a plurality of candidates may be extracted.

【００４８】ステップ１０５）パターン分割部１６
は、複数の候補が属性パターンとして抽出された場合、
各属性パターン間の関係を定める。Step 105) Pattern dividing section 16
Means that if multiple candidates are extracted as attribute patterns,
Determine the relationship between each attribute pattern.

【００４９】例えば、“ＡＢＣＤＣＤＣＤＡＢＡＢＣ
Ｄ”という場合には、属性パターン“ＡＢ”と“ＣＤ”
の階層構造を決め、下位パターンが上位パターンの入れ
子になっているとする。階層構造の定め方は、属性パタ
ーン“ＡＢ”、“ＣＤ”がそれぞれクラスに対応付けら
れているので、クラスデータベース２４を参照し、クラ
ス間の階層関係を参照し、その階層関係を、属性パター
ンの階層関係とする、という方法である。For example, “ABCCDCDDABABC
In the case of "D", the attribute patterns "AB" and "CD"
It is assumed that the lower layer pattern is nested with the upper layer pattern. Since the attribute patterns “AB” and “CD” are associated with the classes, respectively, the hierarchical structure is determined. Is a hierarchical relationship.

【００５０】ステップ１０６）パターン分割部１６
は、抽出された属性パターンとクラスデータベース２４
内の属性パターンをマッチングさせる。Step 106) Pattern dividing section 16
Is the extracted attribute pattern and class database 24
The attribute patterns in are matched.

【００５１】ステップ１０７）マッチングさせた結
果、対応しているクラスを抽出する。ステップ１０８）パターン分割部１６は、属性パター
ンの出現位置で、属性名列を区切り、各区切られた部分
にセグメントＩＤをふる。すべての属性に対し、クラス
ＩＤとセグメントＩＤを付与する。Step 107) As a result of the matching, a corresponding class is extracted. Step 108) The pattern division unit 16 divides the attribute name string at the appearance position of the attribute pattern, and assigns a segment ID to each divided portion. A class ID and a segment ID are assigned to all attributes.

【００５２】ステップ１０９）分割されるセグメント
を、入れ子になっている下位セグメントとし、セグメン
ト間の階層関係を付与し、ステップ１０４に移行し、セ
グメント中に同じ意味属性が複数現れる場合、ステップ
１０４、１０５、１０６の処理をさらに行う。Step 109) The segment to be divided is set as a nested lower segment, a hierarchical relationship is given between the segments, and the process proceeds to step 104. If a plurality of the same semantic attributes appear in the segment, the process proceeds to step 104. The processing of 105 and 106 is further performed.

【００５３】ステップ１１０）ステップ１０３におい
て、同じ属性が複数現れない場合も、パターン分割部１
６は、現れる属性名集合と、クラスデータベース２４と
のマッチングを行い、該当するクラスを抽出する。クラ
スに対するセグメントは１つとする。Step 110) If the same attribute does not appear in Step 103, the pattern dividing unit 1
6 performs matching between the appearing attribute name set and the class database 24 to extract a corresponding class. There is one segment for a class.

【００５４】ステップ１１１）出力部１７は、すべて
の属性情報に、クラスＩＤとセグメントＩＤを付与して
出力する。例えば、図５（Ａ），（Ｂ）のように、クラ
ス毎にＸＭＬ文書として出力したり、データベースに格
納する。同図において、（Ａ）、（Ｂ）は、同じ文書Ｉ
Ｄを有する。つまり、同じ文書が元になっているが、構
造化の結果が異なっている。これは、文脈を表すクラス
情報が異なっているからである。構造化結果は、クラス
情報に応じて変化するため、（Ａ），（Ｂ）のように、
異なった出力情報が生成される。Step 111) The output unit 17 outputs a class ID and a segment ID to all the attribute information. For example, as shown in FIGS. 5A and 5B, an XML document is output for each class or stored in a database. In the figure, (A) and (B) show the same document I.
D. In other words, they are based on the same document, but have different structured results. This is because the class information indicating the context is different. Since the structured result changes according to the class information, as shown in (A) and (B),
Different output information is generated.

【００５５】また、出力部１７は、ステップ１０５及び
ステップ１０８で保持された階層関係に対し、クラスＩ
ＤとセグメントＩＤで識別される領域毎に階層関係を出
力する。出力イメージは、図６に示すように、データベ
ースのテーブルに格納する。同図において、当該データ
ベースには、一つのセグメントＩＤに対しその子固有セ
グメントＩＤと親固有セグメントＩＤを格納する。The output unit 17 outputs the class I to the hierarchical relationship held in step 105 and step 108.
The hierarchical relationship is output for each area identified by D and the segment ID. The output image is stored in a database table as shown in FIG. In the figure, the database stores a child unique segment ID and a parent unique segment ID for one segment ID.

【００５６】すべての属性名列に対して、上記のステッ
プ１０４〜ステップ１１１までを行うため、１文書に対
し、複数のクラスで構造化を行った結果を出力すること
ができる。Since the above steps 104 to 111 are performed for all the attribute name strings, the result of structuring one document with a plurality of classes can be output.

【００５７】また、上記の実施例では、図３の構成と図
４の動作のフローチャートに沿って説明しているが、図
４の動作をプログラムとして構築し、文書文脈構造化装
置として利用されるコンピュータに接続されるディスク
装置や、フロッピーディスク、ＣＤ−ＲＯＭ等の可搬記
憶媒体に格納しておき、本発明を実施する際にインスー
ルすることにより、容易に本発明を実現できる。Although the above embodiment has been described with reference to the configuration of FIG. 3 and the flowchart of the operation of FIG. 4, the operation of FIG. 4 is constructed as a program and used as a document context structuring apparatus. The present invention can be easily realized by storing it in a portable storage medium such as a disk device connected to a computer, a floppy disk, or a CD-ROM, and installing it when implementing the present invention.

【００５８】なお、本発明は、上記の実施例に限定され
ることなく、特許請求の範囲内において種々変更・応用
が可能である。It should be noted that the present invention is not limited to the above-described embodiment, but can be variously modified and applied within the scope of the claims.

【００５９】[0059]

【発明の効果】上述のように、本発明は、文書の意味的
構造化において、各意味属性が複数現れることがある
が、各属性の帰属先を識別するために、文書内情報を書
かれている対象毎に分割し、属性間の関係性を記述する
ことができる。As described above, according to the present invention, a plurality of each semantic attribute may appear in the semantic structuring of a document, but information in the document is written in order to identify the destination of each attribute. Can be divided for each target, and the relationship between attributes can be described.

【００６０】また、本発明は、文書内情報を分割する際
に、複数の文脈での分割を行うことで、さまざまな文脈
での検索ができる。Further, according to the present invention, a search in various contexts can be performed by dividing information in a document in a plurality of contexts.

[Brief description of the drawings]

【図１】本発明の原理を説明するための図である。FIG. 1 is a diagram for explaining the principle of the present invention.

【図２】本発明の原理構成図である。FIG. 2 is a principle configuration diagram of the present invention.

【図３】本発明の一実施例の文書文脈構造化装置の構成
図である。FIG. 3 is a configuration diagram of a document context structuring apparatus according to an embodiment of the present invention.

【図４】本発明の一実施例の文書文脈構造化装置の動作
を示すフローチャートである。FIG. 4 is a flowchart showing an operation of the document context structuring apparatus according to one embodiment of the present invention.

【図５】本発明の一実施例のＸＭＬを用いた構造化文書
で出力した場合の例である。FIG. 5 is an example in the case of outputting a structured document using XML according to an embodiment of the present invention.

【図６】本発明の一実施例のクラス「地域」とクラス
「店」間の関係を示す図である。FIG. 6 is a diagram illustrating a relationship between a class “area” and a class “shop” according to an embodiment of the present invention.

[Explanation of symbols]

１文書文脈構造化装置１１文書入力手段、文書入力部１２データベース、文書格納部１３属性付与手段、属性付与部１４パターン抽出手段、パターン抽出部１６パターン分割手段、パターン分割部１７出力手段、出力部２２属性辞書２３パターンマッチルール２４クラスデータベース 1 Document Context Structuring Device 11 Document Input Unit, Document Input Unit 12 Database, Document Storage Unit 13 Attribute Assigning Unit, Attribute Assigning Unit 14 Pattern Extracting Unit, Pattern Extracting Unit 16 Pattern Dividing Unit, Pattern Dividing Unit 17 Output Unit, Output Unit 22 Attribute Dictionary 23 Pattern Match Rule 24 Class Database

───────────────────────────────────────────────────── フロントページの続き (72)発明者高橋克己東京都千代田区大手町二丁目３番１号日本電信電話株式会社内Ｆターム(参考） 5B009 QA09 SA14 VA09 ────────────────────────────────────────────────── ─── Continuing on the front page (72) Inventor Katsumi Takahashi 2-3-1 Otemachi, Chiyoda-ku, Tokyo F-term in Nippon Telegraph and Telephone Corporation (reference) 5B009 QA09 SA14 VA09

Claims

[Claims]

1. A document context structure for performing a semantic search of information considering a context, integrating and classifying semantically similar information by performing structuring based on a semantic context in a document. In the structuring method, an attribute is assigned to a character string in the input unstructured document, and when the same attribute name appears in a set of the attributes, the unstructured document is written for a plurality of objects. A document context structuring method characterized by determining that the document is a document, and structuring in a plurality of contexts.

2. When a plurality of the same attribute names appear, create an attribute name sequence in which the plurality of appearing attribute names are arranged in the order in which the attribute names appear in the document, and at least one attribute name sequence is repeated in the attribute name sequence. The attribute pattern is extracted, the attribute name string is divided for each of the attribute patterns, and the structured information in the document is divided into a plurality of segments.
Document context structuring method of description.

3. The attribute pattern is stored in a database in association with a concept (class) of a context in advance, and the database is referred to based on the extracted attribute pattern to assign a class to a document. Document context structuring method.

4. A class and a segment are assigned to all documents, a class ID and a segment ID are assigned to all attributes in the document, and each attribute is uniquely identified and output. Or the document context structuring method according to 3.

5. When a pattern nest is found from the attribute name string, the result of segmentation is hierarchically retained according to a plurality of attribute patterns, and a document ID and a class ID are assigned to each segment. The document context structuring method according to claim 2, wherein the hierarchical relation is output for each individual segment.

6. An attribute assigned to information in one document, an attribute name string created by arranging the attributes, and one or more attribute patterns extracted from the attribute name string are assigned to one document. Is one or more classes, and for one document,
A structuring in one or more contexts, wherein the structuring is performed in one or more contexts.
3. The document context structuring method according to 3.

7. A document context structure for performing a semantic search of information considering a context, integrating and classifying semantically similar information by performing structuring based on a semantic context in a document. A document input unit for inputting an unstructured document; an attribute assigning unit for assigning an attribute to a character string in the unstructured document; A document context structuring device, comprising: a pattern division unit configured to determine that the document is a document written for a plurality of objects when a plurality of documents appear, and to perform structuring in a plurality of contexts.

8. When a plurality of the same attribute names appear, create an attribute name sequence in which the plurality of attribute names appear in the order in which they appear in the document, and create at least one attribute name sequence repeated in the attribute name sequence. Pattern extracting means for extracting an attribute pattern of the attribute pattern, wherein the pattern dividing means has means for dividing the attribute name string for each of the attribute patterns and dividing structured information in a document into a plurality of segments. The document context structuring device according to claim 7.

9. The pattern extracting means has means for storing an attribute pattern in a database in advance in association with a concept (class) of a context, and the pattern dividing means has a function based on the attribute pattern to be extracted. 9. The document context structuring apparatus according to claim 8, further comprising means for referring to said database and assigning a class to a document.

10. The pattern dividing means has means for assigning a class and a segment to all documents and assigning a class ID and a segment ID to all attributes in the document. 10. The document context structuring apparatus according to claim 7, further comprising an output unit for identifying and outputting.

11. When the pattern nesting is found from the attribute name string, the pattern dividing means hierarchically stores the result of segmenting according to a plurality of attribute patterns, and Document I
9. The document context structuring apparatus according to claim 8, further comprising: means for assigning D and a class ID; and wherein said output means includes means for outputting a hierarchical relationship for each individual segment.

12. The pattern dividing means may include at least one attribute given to information in one document, an attribute name string created by arranging attributes, and one or more attribute patterns extracted from the attribute name string. 10. The method according to claim 7, wherein one or more classes are assigned to one document, and a means for structuring one document in one or more contexts is provided.
Document Context Structuring Device.