JP3610679B2

JP3610679B2 - Document structure conversion apparatus and document structure conversion method

Info

Publication number: JP3610679B2
Application number: JP16563596A
Authority: JP
Inventors: 仁樹京嶋; 和也千葉
Original assignee: Fuji Xerox Co Ltd; Fujifilm Business Innovation Corp
Current assignee: Fujifilm Business Innovation Corp
Priority date: 1996-06-26
Filing date: 1996-06-26
Publication date: 2005-01-19
Anticipated expiration: 2016-06-26
Also published as: JPH1011441A

Description

【０００１】
【発明の属する技術分野】
本発明は構造化文書の文書構造を変換する文書構造変換装置及び文書構造変換方法に関し、特に第１の文書クラスに従った文書構造を第２の文書クラスに従った文書構造に変換する文書構造変換装置に関する。
【０００２】
【従来の技術】
構成要素として章、節、項などに代表される論理的な構造を持つ文書を構造化文書と呼ぶが、この文書の構造の統一を図ることで、文書の共有や変換が容易になることが知られている。ＳＧＭＬ（ＳｔａｎｄａｒｄＧｅｎｅｒａｌｉｚｅｄＭａｒｋｕｐＬａｎｇｕａｇｅ）や、ＯＤＡ（ＯｆｆｉｃｅＤｏｃｕｍｅｎｔＡｒｃｈｉｔｅｃｔｕｒｅ）などの国際規格の普及もあり、構造化文書は電子的な文書の主流となりつつある。
【０００３】
この構造化文書は通常、文書の構造及び構成要素を定義した文書クラスと呼ばれる分類に従って構造化されている。ＯＤＡでは共通論理構造が文書クラスに該当し、ＳＧＭＬではＤＴＤ（ＤａｔａＴｙｐｅＤｅｆｉｎｉｔｉｏｎ）が文書クラスの役割を担っている。なお、これ以降の説明で文書と言った場合には、全て構造化文書を示す。
【０００４】
文書の持つ構造が文書クラスの制約に従っていることには、重要な意味がある。例えば、複数の文書をレイアウトするための規則は、文書の構造や、その構成要素が特定の文書クラスの制約を満たしていることを前提に決められることが多い。そのため、ある文書の構造が文書クラスの制約から逸脱している場合には、正しいレイアウトでの出力ができない。
【０００５】
また、報告書群からの抄録リスト作成のような、多くの文書を処理するためのプログラムには、対象となる文書が特定の文書クラスに従って作成されていることを利用したものが多い。このようなプログラムを利用する際、所定の文書クラスに従わない文書の存在は、プログラム実行の障害となり得る。
【０００６】
さらに、大量の文書を対象にしたデータベースでは文書クラスをスキーマとして利用することが多い。そのため、スキーマから逸脱したデータの存在はデータベースの信頼性を大きく損なうこととなる。
【０００７】
即ち、文書クラスＡに従った文書構造を持つ文書を対象としたプログラムを使用する際、その文書構造がどんなに似通っていようとも、別の文書クラスＢに従った文書構造を持つ文書への適用は難しい。このような場合に、文書クラスＢに従った文書構造を文書クラスＡに従った文書構造に変換する必要性が生じる。
【０００８】
上記の問題を解決するために、任意の文書構造を別の文書構造に変換する技術が、いくつか存在している。文書の構造は全て木構造で表わせるものとして、以下に説明を行う。
【０００９】
まず、ＦｒｅｄＣｏｌｅ，ＨｅａｔｈｅｒＢｒｏｗｎ著のＥｄｉｔｉｎｇＳｔｒｕｃｔｕｒｅｄＤｏｃｕｍｅｎｔｓ−Ｐｒｏｂｌｅｍａｎｄｓｏｌｕｔｉｏｎｓ（ＥｌｅｃｔｒｏｎｉｃＰｕｂｌｉｓｈｉｎｇ，ｖｏｌ．５，Ｎｏ．４，ｐｐ．２０９−２１６）に記載されている、「ｆａｌｌｂａｃｋｃｌａｓｓ」の概念を応用することにより文書構造の変換が可能である。これは、文書クラスＢで規定されたノードのタイプが、文書クラスＡで規定されたノードのタイプのうち、どのタイプに変換されるかを規則として定義しておくものである。変換を実行する際には予め定義されている規則に基づいて、文書クラスＢに従った文書構造に含まれる各ノードを、文書クラスＡに従った文書構造に含まれるノードに逐一変換する。
【００１０】
また、特開平７ー２８８１７号公報には、文書クラスＢを構成するタイプと文書クラスＡを構成するタイプとの１対１の対応関係を、ユーザが定義する方法が開示されている。定義した対応関係を利用することで文書構造の変換を行う。
【００１１】
【発明が解決しようとする課題】
しかし、文書クラスＡと文書クラスＢとで、それぞれの文書構造を構成するノードに持たせる内容の細かさが違う場合、これらの方法では所望の文書構造は得られない。
【００１２】
例えば、文書の著者の情報が、文書クラスＡでは「著者」という単一のノードの内容になるのに対し、文書クラスＢでは「姓」というノードの内容に著者の姓が、「名」というノードの内容に著者の名が配置されるとする。
【００１３】
このような場合、上記のいずれの方法を用いたとしても、文書クラスＡに従った文書構造を文書クラスＢに従った文書構造に変換することは不可能である。これらの方法では、文書構造中のノードを内容に応じて分割し、別々のノードの内容に変換する機能は実現できない。
【００１４】
なお、構造化されていない文書をその内容に応じて分割し、別々のノード及びその内容に変換して希望する文書クラスに従った文書構造を作成する方法は存在している。
【００１５】
特開昭６２ー２４９２７０号公報には、見出しとして使用される言葉を辞書に持ち、その辞書を利用してユーザから入力された文書の見出しを決定し、文書を構造的に表示する文書処理装置が開示されている。
【００１６】
また、特開平４ー４４６７号公報には、文字列パターンと文書構造との対応関係を登録し、入力された文書データとのマッチングを行って文書構造を解析する文書構造解析装置が開示されている。
【００１７】
しかし、上記の技術はいずれも、文書構造の正規化には利用できない。即ち、既に任意の文書クラスに従って構造化されている文書の文書構造を別の文書クラスに従った文書構造に変換する作業は不可能である。
【００１８】
本発明はこのような点に鑑みてなされたものであり、文書構造を別の文書構造に変換し、その際元の文書構造の構成ノードの内容を任意に分割することが可能な文書構造変換装置を提供することを目的とする。
【００１９】
【課題を解決するための手段】
本発明では上記課題を解決するために、第１の文書クラスに従った文書構造を第２の文書クラスに従った文書構造に変換する文書構造変換装置において、前記第１の文書クラスに従った文書構造中の分割すべきノードの条件を示す適用条件と、分割すべきノードが持つ内容の分割位置を示す区切りとなる文字列のパターンで規定される分割パターンとからなる内容分割規則を管理する内容分割規則管理手段と、前記第１の文書クラスに従った文書構造中のノード及び分割されることにより生成されるノードが、前記第２の文書クラスに従った文書構造中のどのタイプのノードに変換されるのかが定義された構造変換規則を管理する構造変換規則管理手段と、前記第１の文書クラスに従った第１の文書構造が入力されると、前記第１の文書構造の中から前記適用条件で示された条件を満たす分割対象ノードを決定する分割規則適用要素決定手段と、前記分割パターンに適合する位置を基準として前記分割対象ノードが持つ内容を分割し、分割された内容を個別に持つ複数の断片ノードを前記第１の文書構造の所定の位置に接続し、細分化文書構造を作成する内容分割手段と、前記構造変換規則に従って、前記内容分割手段によって作成された前記細分化文書構造を変換し、第２の文書構造を作成する構造変換手段と、を有することを特徴とする文書構造変換装置が提供される。
【００２０】
この文書構造変換装置によれば、第１の文書クラスに従った第１の文書構造が入力されると、分割規則適用要素決定手段によって第１の文書構造の中から前記適用条件で示された条件を満たす分割対象ノードが決定される。次いで、内容分割手段により、区切りとなる文字列のパターンで規定される分割パターンに適合する位置を基準として分割対象ノードが持つ内容が分割され、分割された内容を個別に持つ複数の断片ノードが第１の文書構造の所定の位置に接続され、細分化文書構造が作成される。そして、この細分化文書構造が、構造変換手段により構造変換規則に従って変換され、第２の文書構造が作成される。このようにして作成された第２の文書構造は、第１の文書クラスよりも細分化された構造制約を有する第２の文書クラスに適合している。
【００２１】
また、第１の文書クラスに従った文書構造を第２の文書クラスに従った文書構造に変換する文書構造変換方法において、前記第１の文書クラスに従った第１の文書構造の中から、前記第１の文書クラスに従った文書構造中の分割すべきノードの条件が示された適用条件を満たす分割対象ノードを分割規則適用要素決定手段により決定し、分割すべきノードが持つ内容の分割位置が示す区切りとなる文字列のパターンで規定される分割パターンに適合する位置を基準として、前記分割対象ノードが持つ内容を内容分割手段により分割し、分割された内容を個別に持つ複数の断片ノードを前記第１の文書構造の所定の位置に接続して、細分化文書構造を前記内容分割手段により生成し、前記第１の文書クラスに従った文書構造中のノード及び分割されることにより生成されるノードが、前記第２の文書クラスに従った文書構造中のどのタイプのノードに変換されるのかが定義された構造変換規則に従って、前記細分化文書構造を変換し、第２の文書構造を作成する、ことを特徴とする文書構造変換方法が提供される。
【００２２】
この文書構造変換方法によれば、第１の文書構造の分割対象ノードの持つ内容が、区切りとなる文字列のパターンで規定される分割パターンに適合する位置を基準として分割される。そして、分割された内容を個別に持つ複数の断片ノードが第１の文書構造の所定の位置に接続された細分化文書構造が生成される。さらに、この細分化文書構造が構造変換規則に従って変換され、第２の文書構造が作成される。このようにして、第１の文書クラスよりも細分化された構造制約を有する第２の文書クラスに適合した第２の文書構造が作成される。
【００２３】
【発明の実施の形態】
以下、本発明の実施の形態を図面に基づいて説明する。図１は本発明の原理図である。
【００２４】
本発明では第１の文書クラスに従った第１の文書構造１を、第２の文書クラスに従った第２の文書構造２に変換する。
内容分割規則管理手段３は、第１の文書構造中の分割対象となるノードの条件を示す適用条件と、分割対象となったノードが持つ内容の分割位置を示す分割パターンとからなる内容分割規則を管理している。また、構造変換規則管理手段６は、第１の文書構造中のノード及び分割されることにより生成されるノードが、第２の文書構造中のどのタイプのノードに変換されるのかを定義した構造変換規則を管理している。
【００２５】
第１の文書構造が入力されると、分割規則適用要素決定手段４が、内容分割規則に従って、第１の文書構造の中から適用条件で示された条件を満たす分割対象ノードを決定する。
【００２６】
内容分割手段５は、分割規則に従って分割対象ノードが持つ内容を分割する。そして、分割された内容を持つ複数の断片ノードを第１の文書構造に接続することにより細分化文書構造を作成する。
【００２７】
構造変換手段７は、構造変換規則に従って細分化文書構造を第２の文書構造に変換する。
図２は、本発明の文書構造変換装置のブロック図である。
【００２８】
文書構造変換装置２０には入出力ｕ／ｉ１０が接続されており、ユーザからの入力を受け付けたり、出力を表示したりといった役割を果たしている。
文書構造変換装置２０は、入力された文書データの文書構造を抽出する文書パーサ２１と、文書クラスの定義情報を管理している文書クラス管理部２２と、文書変換に利用する規則を管理する変換規則管理部２３と、入力文書構造の内容を分割する分割処理部２４と、文書構造の変換を行う構造変換部２５と、変換結果を文書データにして出力する文書ジェネレータ２６と、から構成されている。
【００２９】
ユーザは入力データとして、構造を変換したい文書データ（入力文書データ）と、入力文書データの文書構造を規定している文書クラス名（入力文書クラス名）と、出力したい文書構造を規定している文書クラス名（出力文書クラス名）とを、入出力ｕ／ｉ１０を介して文書構造変換装置２０に入力する。
【００３０】
文書パーサ２１は、入力データを受け付けると、まず入力文書クラス名を文書クラス管理部２２へ入力して、その定義情報を要求する。そして、文書クラス管理部２２から入力文書クラスの定義情報を供給されると、その定義情報に基づいて入力文書データを解析し、その文書構造（入力文書構造）を得る。それから、入力文書クラス名及び出力文書クラス名を変換規則管理部２３へ、入力文書構造を分割処理部２４へ、それぞれ入力する。
【００３１】
文書クラス管理部２２は、本発明の文書構造変換装置２０で取り扱い可能な文書クラスの定義情報を全て管理している。そして、文書パーサ２１から入力文書クラス名を入力されると、折り返し入力文書クラスの定義情報を供給する。
【００３２】
変換規則管理部２３は、文書構造変換に利用する変換規則を管理する。ここで、変換規則とは、ノードの内容を分割する際に利用する規則である内容分割規則と、ノードのタイプを変換する際に利用する規則である構造変換規則とを言う。なお、変換規則管理部２３は、内容分割規則を管理する内容分割規則管理部２３ａと、構造変換規則を管理する構造変換規則管理部２３ｂとを含む。
【００３３】
内容分割規則管理部２３ａでは、内容分割規則が入力文書クラス名及び出力文書クラス名とセットで管理されている。また、構造変換規則管理部２３ｂでも同様に、構造変換規則が入力文書クラス名及び出力文書クラス名とセットで管理されている。
【００３４】
変換規則管理部２３に入力文書クラス名及び出力文書クラス名が入力されると、内容分割規則管理部２３ａは両文書クラス名とセットになっている内容分割規則を探索する。探索の結果、適当な内容分割規則があれば、該当する内容分割規則を分割規則要素決定部２４ａへ供給する。また、構造変換規則管理部２３ｂも両文書クラス名とセットになっている構造変換規則を探索し、該当する構造変換規則を構造変換部２５へ供給する。
【００３５】
分割処理部２４は、供給された内容分割規則から適用要素を決定する分割規則適用要素決定部２４ａと、内容分割部２４ｂと、を持つ。
分割処理部２４には、文書パーサ２１から入力文書構造が入力される。変換規則管理部２３から入力データに対応する内容分割規則が供給されると、入力文書構造中の全てのノードに対して順次、適用要素の決定と内容の分割を行って入力文書構造に接続、細分化文書構造を作成して構造変換部２５へ入力する。また、変換規則管理部２３からの内容分割規則の供給がなければ、入力文書構造をそのまま構造変換部２５へ入力する。
【００３６】
この時、分割規則適用要素決定部２４ａは、入力文書構造中の個々のノードに対して、変換規則管理部２３ａから供給された内容分割規則中のどの規則要素が適用されるかを決定する。また、内容分割部２４ｂは、ノードが持つ文字列内容を内容分割規則要素中の分割パターンに基づいて分解し、複数の断片ノードを作成する。その後、作成した断片ノードを入力文書構造に接続して、細分化文書構造を作成する。
【００３７】
構造変換部２５には、分割処理部２４から細分化文書構造もしくは入力文書構造が入力され、変換規則管理部２３から構造変換規則が供給される。入力された文書構造はここで構造変換規則に基づいた変換が行われ、出力文書クラスに従った文書構造（出力文書構造）となって文書ジェネレータ２６に入力される。
【００３８】
文書ジェネレータ２６には、構造変換部２５から出力文書構造が入力される。入力された出力文書構造を文書データに変換して入出力ｕ／ｉを介してユーザに提供する。
【００３９】
ここで、本発明の文書構造変換装置による文書構造変換の手順について説明する。図３は文書構造変換の手順を示すフローチャートである。以下、図中のステップに沿って説明を行う。
［Ｓ１］入出力ｕ／ｉ１０を介して、入力データを受け付ける。入力データとは、入力文書クラス名、出力文書クラス名、及び入力文書データを示す。
［Ｓ２］文書パーサ２１は変換規則管理部２３に、入力文書クラス名と出力文書クラス名とを入力する。両文書クラス名の入力を受けた変換規則管理部２３は、両文書クラスとセットで管理している変換規則があれば取り出す。つまり、該当する内容分割規則が内容分割規則管理部２３ａで管理されていれば、分割規則適用要素決定部２４ａへ供給する。また、該当する構造変換規則が構造変換規則管理部２３ｂで管理されていれば、構造変換部２５へ供給する。
［Ｓ３］文書パーサ２１は、変換規則管理部２３で構造変換規則が取り出せたか否かを判断する。取り出せていればステップＳ４に進む。また、取り出せていなければ入出力ｕ／ｉを介してエラーメッセージを出し、処理を終了する。構造変換規則が取り出せなかった場合、ユーザは入力文書クラス名及び出力文書クラス名が正しいか、また両文書クラス間の文書構造変換のための構造変換規則が正しく管理されているか、確認するべきである。
［Ｓ４］文書パーサ２１は、入力文書クラス名を文書クラス管理部２２に入力して入力文書クラスの定義情報を得る。そして、得た定義情報を基に入力文書データを解析し、入力文書構造として分割処理部２４に入力する。
［Ｓ５］分割処理部２４では、変換規則管理部２３から内容分割規則が供給されたか否か判断する。内容分割規則が供給されていればステップＳ６へ進む。また供給されていなければ、入力文書構造を構造変換部２５へ入力し、ステップＳ７へ進む。
［Ｓ６］分割処理部２４は、入力文書構造の内容を分割して、細分化文書構造を作成し、構造変換部２５へ入力する。なお、細分化文書構造を作成する手順については、後に詳しく述べる。
［Ｓ７］構造変換部２５は、入力された細分化文書構造もしくは入力文書構造を供給された構造変換規則に基づいて変換し、出力文書クラスに従った文書構造（出力文書構造）を作成する。そして、作成した出力文書構造を文書ジェネレータ２６に入力する。なお、構造変換処理の手順については、後に詳しく述べる。
［Ｓ８］文書ジェネレータ２６は、入力された出力文書構造を文書データとし、入出力ｕ／ｉを介してユーザに提供する。
【００４０】
ここで、本発明の文書構造変換装置を構成する各部について、更に詳細に説明を行う。
文書パーサ２１は、入出力ｕ／ｉ１０を介したユーザからの入力データの受け付けと、入力文書クラス名及び出力文書クラス名の変換規則管理部２３への入力と、その入力データに適合する構造変換規則が、変換規則管理部２３で管理されているか否かの判断とを行う。そして判断の結果、構造変換規則が管理されていた場合、入力文書データから入力文書構造の抽出、抽出した入力文書構造の分割処理部２４への入力を行う。また、判断の結果、構造変換規則が管理されていなかった場合、エラーメッセージの出力を行う。
【００４１】
文書クラス管理部２２は、文書クラスの管理を行う。ここでは一つの文書クラスをその文書クラスの名称と、その文書クラスの定義情報とのペアで記憶している。なお、文書クラス管理部２２には複数の文書クラスが記憶されるが、個々の文書クラスの名称は、この文書構造変換装置２０内で一意である。
【００４２】
文書クラスの定義情報は、文書構造を構成するノードのタイプ定義と、定義されたノードの接続関係を規定する構造制約とから構成される。特定の文書クラスの定義情報を満たした文書構造を、その文書クラスの文書構造、と呼ぶ。また、その文書クラスの文書構造を持つ文書データを、その文書クラスの文書データ、と呼ぶ。
【００４３】
文書構造を構成する要素（ノード）のタイプ定義は、次の２つの要素からなっている。すなわち、ノードのタイプを識別するための文字列であるタイプ名と、ノードの持つ内容の種類を示す内容型指定とである。ここでこの内容型指定は、「内容を持たない」か、「文字列型の内容を持つ」か、「幾何図形型の内容を持つ」かの３種類の内の一つである。なお本実施例では、下位構造を持つノードは全て、内容を持たないものとする。
【００４４】
また、タイプ名が「Ａ」であるタイプのことを「Ａタイプ」と呼び、「Ａタイプのノード」のことを「Ａノード」と呼ぶ。
定義されたノードの接続関係を規定する構造制約は、次に示す構造制約子と、上記ノードのタイプ定義とから作られる木構造によって定義される。構造制約子にはＳＥＱ、ＲＥＰ、ＯＰＴの３種類があり、それぞれ次のような意味を持つ。
【００４５】
構造制約子ＳＥＱは、複数の下位構造をとり、その下位構造で規定された構造が、規定された順序で出現することを示す。
構造制約子ＲＥＰは、単一の下位構造をとり、その下位構造で規定された構造が、１回以上繰り返し出現することを示す。
【００４６】
構造制約子ＯＰＴは、単一の下位構造をとり、その下位構造で規定された構造が、１回出現するか、あるいは出現しないかであることを示す。
変換規則管理部２３に含まれる内容分割規則管理部２３ａは、内容分割規則の管理を行う。ここで内容分割規則とは入力文書クラス名と出力文書クラス名とセットで記憶される内容分割規則要素のリストである。
【００４７】
内容分割規則要素は、次の３要素、即ち、入力ノード条件と、分割パターンと、頭部削除フラグとから構成されている。
入力ノード条件には、その規則要素が適用される入力ノードについての条件が記載されている。この条件には、入力ノードのタイプや、入力ノードの親ノードや兄弟ノード等についての条件が含まれる。
【００４８】
分割パターンには、入力ノード条件にマッチした入力ノードが持つ文字型の内容を分割する際、その区切りとなる文字列のパターンが規定される。内容を分割することで、複数の断片ノードが作成される。
【００４９】
また、頭部削除フラグは、作成された断片ノードから、分割パターンにマッチした部分を削除するか否かを示す。削除する場合は「Ｔ」、削除しない場合は「Ｆ」が記載される。
【００５０】
ここで、頭部削除フラグの働きについて説明を行っておく。分割パターンとして用いられる文字列には、「（１）」、「●」等のように文章内容の開始や順序を示すものや、「［論理構造］論理構造とは・・・」、「文書クラス：文書クラスに・・・」等のように文章内容の見出しになっているものがある。しかし、構造化文書のレイアウタにはナンバリング機能を持つものが多いため、「（１）」、「●」等の記号は削除した方がよい場合もある。
【００５１】
内容分割規則要素に頭部削除フラグを設けることにより、それら分割パターンにマッチする文字列の状況に応じたコントロールが可能となる。即ち、ノードの内容を分割するにあたり、細かく設定を行うことができる。
【００５２】
変換規則管理部２３に含まれる構造変換規則管理部２３ｂは、構造変換規則の管理を行う。ここで構造変換規則とは入力文書クラス名と出力文書クラス名とセットで記憶される構造変換規則要素のリストである。
【００５３】
構造変換規則要素は、次の２要素、即ち、入力ノード条件と、出力ノードと、から構成されている。
入力ノード条件には、その規則要素が適用される入力ノードについての条件が記載されている。この条件には、入力ノードのタイプや、入力ノードの親ノードや兄弟ノード等についての条件が含まれる。
【００５４】
出力ノードには、入力ノード条件にマッチした入力ノードが、出力文書構造中で、どのタイプのノードになるのかが記載される。
分割処理部２４は、入力文書構造を入力されると、供給された内容分割規則に従ってその内容を分割し、細分化文書構造を作成する。
【００５５】
この時、分割規則適用要素決定部２４ａは、内容分割規則が供給されていれば、入力文書構造の全構成ノードに対し、適用できる規則要素の探索を行う。そして、あるノードに対して適用できる規則要素が探索できた場合、その詳細情報を内容分割部２４ｂに入力する。
【００５６】
内容分割部２４ｂは、分割規則適用要素決定部２４ａから詳細情報を入力された場合、その情報に基づいてノードの内容を分割し、断片ノードを生成する。
生成した断片ノードは、分割処理部２４内で入力文書構造に連結され、細分化文書構造が作成される。ここで作成された細分化文書構造は、構造変換部２５へ入力される。また、内容分割規則管理部２３ａからの内容分割規則の供給がなかった場合、分割処理部２４では入力文書構造に何の処理も行わず、そのまま構造変換部２５へ入力する。なお、分割処理部２４における処理の手順については、この後、順を追って説明する。
【００５７】
構造変換部２５は、構造変換規則管理部２３ｂから構造変換規則の供給を受ける。また、分割処理部２４からは細分化文書構造もしくは入力文書構造の入力を受ける。その後、入力された文書構造を、供給された構造変換規則に基づいて変換し、出力文書クラスに従った文書構造（出力文書構造）を作成する。作成した木構造の状態の出力文書構造は、文書ジェネレータ２６に入力される。なお、構造変換部２５における処理の手順については後に順を追って説明する。
【００５８】
文書ジェネレータ２６は、入力された出力文書構造を文書データに変換する。そして、入出力ｕ／ｉ１０を介してユーザに提供し、文書構造変換処理を終了する。
【００５９】
次に、具体例をあげて説明を行う。文書クラス管理部２２には文書クラス「技術報告書」と文書クラス「Ｒｅｐｏｒｔ」とが管理されているものとする。
図４は、文書クラス管理部２２に管理される文書クラス「技術報告書」の定義情報のうち、文書構造を構成するノードのタイプ定義を示す。
【００６０】
「技術報告書」のタイプ定義３０ａによると、この文書クラスには内容を持たない「技術報告書」ノードと、文字列型の内容を持つ「表題」ノードと、文字列型の内容を持つ「著者」ノードと、内容を持たない「節」ノードと、文字列型の内容を持つ「見出し」ノードと、文字列型の内容を持つ「段落」ノードと、が存在することが判る。
【００６１】
また、図５は、文書クラス「技術報告書」の定義情報のうち、定義されたノードの接続関係を規定する構造制約を示す。
「技術報告書」の構造制約３０ｂによると、この文書クラスはルートノードとして「技術報告書」ノードを持ち、その下位構造としてまず「表題」ノードを、次に「著者」ノードを、最後に「節」ノードを持つことが判る。なお、「節」ノードは複数存在してもよい。また、一つの「節」ノードは、下位構造としてまず「見出し」ノードを、次に「段落」ノードを持つ。ここで、「段落」ノードは、一つの「節」ノードの下に複数存在してもよい。
【００６２】
同様に、図６は文書クラス「Ｒｅｐｏｒｔ」のタイプ定義を示す。
「Ｒｅｐｏｒｔ」のタイプ定義３３ａによると、この文書クラスには内容を持たない「Ｒｐｏｒｔ」ノードと、文字列型の内容を持つ「Ｔｉｔｌｅ」ノードと、文字列型の内容を持つ「Ａｕｔｈｏｒ」ノードと、内容を持たない「Ｗｏｒｄｓ」ノードと、内容を持たない「Ｓｅｃｔｉｏｎ」ノードと、文字列型の内容を持つ「Ｐａｒａｇｒａｐｈ」ノードと、文字列型の内容を持つ「Ｌｅａｄ」ノードと、文字列型の内容を持つ「Ｉｔｅｍ」ノードと、が存在することが判る。
【００６３】
また、図７は文書クラス「Ｒｅｐｏｒｔ」の構造制約を示す。
「Ｒｅｐｏｒｔ」の構造制約３３ｂによると、この文書クラスはルートノードとして「Ｒｐｏｒｔ」ノードを持ち、その下位構造としてまず「Ｔｉｔｌｅ」ノードを、次に「Ａｕｔｈｏｒ」ノードを、次に「Ｗｏｒｄｓ」ノードを、最後に「Ｓｅｃｔｉｏｎ」ノードを持つことが判る。なお、「Ｓｅｃｔｉｏｎ」ノードは、複数存在してもよい。また、「Ｗｏｒｄｓ」ノードは下位構造として、まず「Ｔｉｔｌｅ」ノードを、次に「Ｐａｒａｇｒａｐｈ」ノードを持つ。ここで、「Ｐａｒａｇｒａｐｈ」ノードは複数存在してもよい。更に「Ｐａｒａｇｒａｐｈ」ノードは下位構造として、まず「Ｌｅａｄ」ノードを、次に「Ｉｔｅｍ」ノードを、持つことがある。ここで「Ｉｔｅｍ」ノードは複数存在してもよい。また、この構造制約によると、「Ｓｅｃｔｉｏｎ」ノードの下位構造は、「Ｗｏｒｄｓ」ノードの下位構造と全く同じであることが判る。
【００６４】
以上、説明したような文書クラスの定義情報が文書クラス管理部２２に管理されている場合に、変換規則管理部２３に、入力文書クラス名が「技術報告書」、出力文書クラス名が「Ｒｅｐｏｒｔ」である内容分割規則及び構造変換規則が管理されているものとする。
【００６５】
図８（Ａ）は、入力文書クラス名が「技術報告書」、出力文書クラス名が「Ｒｅｐｏｒｔ」である内容分割規則を示す。
内容分割規則３１は、内容分割規則要素を２つ持っている。第１の内容分割規則要素３１ａの入力ノード条件は「「用語」という文字列を含む内容を持つような「見出し」タイプのノードを子ノードに持つ「節」タイプのノードを親ノードに持つような、「段落」タイプのノード」であり、分割パターンは「＜．＊＞」、頭部削除フラグは「Ｆ］である。
【００６６】
図８（Ｂ）は、この第１の内容分割規則要素３１ａの入力ノード条件を満たすノードを示す。「段落」ノード３１ａａは第１の内容分割規則要素３１ａの入力ノード条件にあるように、「用語」という文字列を含む内容を持つ「見出し」ノードを子ノードに持つ「節」ノードを、親ノードとして持つ。入力ノード条件にはこのように親ノード、兄弟ノードの内容まで条件として含むことができる。
【００６７】
ここで、分割パターンの「＜．＊＞」に含まれる「．」は任意の１文字を、「＊」は直前の要素を０回以上繰り返すことを表わしている。従って、分割パターン「＜．＊＞」には、任意の文字列を「＜」と「＞」とで囲んだものがマッチする。更に第１の規則要素の頭部削除フラグは「Ｆ］であるので、分割パターン「＜．＊＞」にマッチした文字列の位置でノードの内容を分割する際、マッチした文字列の削除を行う必要はない。
【００６８】
内容分割規則３１の第２の内容分割規則要素３１ｂの入力ノード条件は「タイプが「段落」」であり、分割パターンは「＜［０−９］＋＞」、頭部削除フラグは「Ｔ」である。ここで分割パターンの「＜［０−９］＋＞」に含まれる「［０−９］」は０、１、２、３、４、５、６、７、８、９のうちのどれか１文字を、「＋」は直前の要素を１回以上繰り返すことを示している。従って、分割パターン「＜［０−９］＋＞」には、任意の数字列を「＜」と「＞」とで囲んだものがマッチする。更に第２の規則要素の頭部削除フラグは「Ｔ」であるので、分割パターン「＜［０−９］＋＞」にマッチした文字列の位置でノードの内容を分解する際、マッチした文字列の削除を行う。
【００６９】
ここで、内容分割規則３１における内容分割規則要素の順序には意味がある。即ち、あるノードにマッチする内容分割規則要素の探索を行う時には、必ず第１、第２、・・・の順に内容分割規則要素の入力ノード条件をチェックしていき、マッチする内容分割規則要素が見つかった時点で探索作業を終了する。
【００７０】
なお、この例では、内容分割の対象となるノードの内容から生成する断片ノードのうち、最初に分割された断片ノードを「リード」というタイプのノードにする。また、ここで２番目以降に分割された断片ノードを「項目」というタイプのノードにする。
【００７１】
図９は入力文書クラス名が「技術報告書」、出力文書クラス名が「Ｒｅｐｏｒｔ」である構造変換規則を示す。
構造変換規則３２によると、入力タイプ名が「技術報告書」のノードは、出力タイプ名「Ｒｅｐｏｒｔ」のノードに変換されることが判る。以下同様に、入力ノード「表題」は出力ノード「Ｔｉｔｌｅ」に、入力ノード「著者」は出力ノード「Ａｕｔｈｏｒ」に、入力ノード「節」のうち、「用語」という文字列を含む内容を持つ「見出し」ノードを子ノードに持つものは出力ノード「Ｗｏｒｄｓ」に、変換される。また、入力ノード「節」のうち出力ノード「Ｗｏｒｄｓ」に変換されなかったものは出力ノード「Ｓｅｃｔｉｏｎ」に、入力ノード「見出し」は出力ノード「Ｔｉｔｌｅ」に、入力ノード「段落」は出力ノード「Ｐａｒａｇｒａｐｈ」に、入力ノード「リード」は出力ノード「Ｌｅａｄ」に、入力ノード「項目」は出力ノード「Ｉｔｅｍ」に、変換される。
【００７２】
このような情報が予め文書構造変換装置２０で管理されている場合に、「技術報告書」の文書データがどのように変換されるか説明する。
まず、図３のステップＳ６で述べた細分化文書構造の作成手順について、詳しく説明する。この処理は分割処理部２４において行われる。
【００７３】
図１０は、細分化文書構造作成の手順を示すフローチャートである。以下、図中のステップに沿って説明を行う。
なお、この処理では入力文書構造中のノード（入力ノード）の一つに着目して作業を進める。ここでは着目するノードのことをｃｕｒｒｅｎｔノードと呼び、処理を始める時には、ルートノードをｃｕｒｒｅｎｔノードとする。
［Ｓ１１］内容分割規則適用要素決定部２４ａは、供給された内容分割規則から、ｃｕｒｒｅｎｔノードに適用できる規則要素を探索する。探索できた規則要素をｃｕｒｒｅｎｔ規則要素と呼ぶ。
［Ｓ１２］ｃｕｒｒｅｎｔ規則要素が存在するか否か判断する。存在すればステップＳ１３へ、存在しなければステップＳ１４へ、進む。
［Ｓ１３］ｃｕｒｒｅｎｔノードとｃｕｒｒｅｎｔ規則要素が決定したので、内容分割処理を行って、細分化文書構造の作成を終了する。なお、内容分割処理の手順に関しては、後に詳しく説明する。
［Ｓ１４］ｃｕｒｒｅｎｔノードが子ノードを持つか否か判断する。子ノードを持っていればステップＳ１５へ進む。また、子ノードを持っていなければ細分化文書構造の作成を終了する。
［Ｓ１５］ｃｕｒｒｅｎｔノードの最初の子ノードを、新たなｃｕｒｒｅｎｔノードとする。
［Ｓ１６］新たなｃｕｒｒｅｎｔノードに対して細分化文書構造を作成する。即ち、新たなｃｕｒｒｅｎｔノードに対し、このフローチャートのステップＳ１１から終了までの処理を行う。
［Ｓ１７］ｃｕｒｒｅｎｔノードに弟がいるか否か判断する。弟がいればステップＳ１８へ進む。弟がいなければ細分化文書構造の作成を終了する。
［Ｓ１８］ｃｕｒｒｅｎｔノードのすぐ下の弟を、新たなｃｕｒｒｅｎｔノードとして、再度ステップＳ１６へ進む。
【００７４】
ここで、図１０のステップＳ１３で述べた内容分割処理について説明する。この処理は分割処理部２４に含まれた内容分割部２４ｂにおいて、分割規則適用要素決定部２４ａから詳細情報（ｃｕｒｒｅｎｔノードとｃｕｒｒｅｎｔ規則要素）を入力された場合に行われる。
【００７５】
図１１、１２は、内容分割処理の手順を示すフローチャートである。以下、図中のステップに従って説明を行う。なお、この処理は分割処理部２４で行われ、ここに示す手順は図８（Ａ）に示した内容分割規則３１に基づいている。
【００７６】
また、この処理では内容中の処理位置を示す次の三つの変数ｐｒｅｖｉｏｕｓ−ｓｔａｒｔ、ｐｒｅｖｉｏｕｓ−ｅｎｄ、ｃｕｒｒｅｎｔｐｏｓｉｔｉｏｎ、を利用して内容分割処理を進める。
［Ｓ２１］ｐｒｅｖｉｏｕｓ−ｓｔａｒｔと、ｐｒｅｖｉｏｕｓ−ｅｎｄとを、未定とする。
［Ｓ２２］現在のｃｕｒｒｅｎｔノードの内容の先頭をｃｕｒｒｅｎｔｐｏｓｉｔｉｏｎとする。
［Ｓ２３］ｃｕｒｒｅｎｔｐｏｓｉｔｉｏｎから後方向へ向け、現在のｃｕｒｒｅｎｔ規則要素の分割パターンとマッチする文字列を探索する。
［Ｓ２４］分割パターンにマッチする文字列が探索できたか否か判断する。探索できればステップＳ２５に進む。また、探索できなければステップＳ３３に進む。
［Ｓ２５］ｐｒｅｖｉｏｕｓ−ｓｔａｒｔが未定であるか否か判断する。未定であればステップＳ２６へ、未定でなければステップＳ２８へ、進む。
［Ｓ２６］タイプが「リード」のノードを作成し、ｃｕｒｒｅｎｔノードの長男にする。
［Ｓ２７］内容の先頭から、分割パターンにマッチした文字列の開始位置までの内容を、ステップＳ２６で作成した「リード」ノードの内容にする。
［Ｓ２８］タイプが「項目」のノードを作成し、ｃｕｒｒｅｎｔノードの末子にする。
［Ｓ２９］ｃｕｒｒｅｎｔ規則要素の頭部削除フラグが「Ｔ］であるか否か判断する。「Ｔ］であればステップＳ３０に、「Ｆ］であればステップＳ３１に進む。
［Ｓ３０］ｐｒｅｖｉｏｕｓ−ｅｎｄから分割パターンにマッチした文字列の開始位置までの内容を、ステップＳ２８で作成した「項目」ノードの内容にする。
［Ｓ３１］ｐｒｅｖｉｏｕｓ−ｓｔａｒｔから分割パターンにマッチした文字列の開始位置までの内容を、ステップＳ２８で作成した「項目」ノードの内容にする。
［Ｓ３２］分割パターンにマッチした文字列の開始位置をｐｒｅｖｉｏｕｓ−ｓｔａｒｔにする。また、分割パターンにマッチした文字列の終了位置をｐｒｅｖｉｏｕｓ−ｅｎｄおよびｃｕｒｒｅｎｔｏｐｏｓｉｔｉｏｎにする。その後、再度ステップＳ２３へ進む。
［Ｓ３３］ｐｒｅｖｉｏｕｓ−ｓｔａｒｔが未定であるか否か判断する。未定であれば、現在のｃｕｒｒｅｎｔノードの内容には分割パターンにマッチする文字列がなく、内容を分割する必要がないということなので、内容分割処理を終了する。未定でなければ、ステップＳ３４に進む。
［Ｓ３４］タイプが「項目」のノードを作成し、ｃｕｒｒｅｎｔノードの末子にする。
［Ｓ３５］ｃｕｒｒｅｎｔ規則要素の頭部削除フラグが「Ｔ］であるか否か判断する。「Ｔ］であればステップＳ３６に、「Ｆ］であればステップＳ３７に進む。
［Ｓ３６］ｐｒｅｖｉｏｕｓ−ｅｎｄから内容の最後までを、ステップＳ３４で作成した「項目」ノードの内容にし、内容分割処理を終了する。
［Ｓ３７］ｐｒｅｖｉｏｕｓ−ｓｔａｒｔから内容の最後までを、ステップＳ３４で作成した「項目」ノードの内容にし、内容分割処理を終了する。
【００７７】
次に、図３のステップＳ７で述べた構造変換処理の手順について、詳しく説明する。この処理は、図９に示した構造変換規則３２の供給を受けた構造変換部２５において行われる。
【００７８】
図１３は、構造変換処理の手順を示すフローチャートである。以下、図中のステップに沿って説明を行う。
なお、この処理では入力文書構造のノードの一つをｃｕｒｒｅｎｔノードとして、また、変換処理によって作成される文書構造のノードの一つを親ノードとして、構造変換処理を進める。処理の最初の段階では入力文書構造のルートノードをｃｕｒｒｅｎｔノードとしておく。この時、親ノードは未定とする。
［Ｓ４１］ｃｕｒｒｅｎｔノードに適用可能な（入力ノード条件がマッチする）構造変換規則要素を変換規則の中から探索する。
［Ｓ４２］構造変換規則要素で指定されたタイプのノードを作成する。
［Ｓ４３］ｃｕｒｒｅｎｔノードがルートノードであるか否か判断する。ルートノードであればステップＳ４５へ、ルートノードでなければステップＳ４４へ、進む。
［Ｓ４４］ｃｕｒｒｅｎｔノードがルートノードでないということは、既に親ノードが存在しているということである。ステップＳ４２で作成したノードを、親ノードの末子として連結する。
［Ｓ４５］ステップＳ４２で作成したノードを、新たな親ノードとする。
［Ｓ４６］ｃｕｒｒｅｎｔノードが子ノードを持つか否か判断する。子ノードを持てば、ステップＳ４７へ進む。子ノードを持たなければ、この構造変換処理を終了する。
［Ｓ４７］ｃｕｒｒｅｎｔノードの長男を、新たなｃｕｒｒｅｎｔノードとする
［Ｓ４８］新たなｃｕｒｒｅｎｔノードと、新たな親ノードとに対し、構造変換処理を行う。即ち、新たなｃｕｒｒｅｎｔノードと、新たな親ノードとに対し、このフローチャートのステップＳ４１から終了までの処理を行う。
［Ｓ４９］ｃｕｒｒｅｎｔノードが弟を持つか否か判断する。弟を持てばステップＳ５０に進む。弟を持たなければこの構造変換処理を終了する。
［Ｓ５０］ｃｕｒｒｅｎｔノードの弟を新たなｃｕｒｒｅｎｔノードとして、再度ステップＳ４８へ進む。
【００７９】
以上説明したような手順で文書構造の変換が行われるが、ここで文書クラス「技術報告書」に従った文書構造を持つ文書データを、文書クラス「Ｒｅｐｏｒｔ」に従った文書構造を持つ文書データに変換する例をあげる。
【００８０】
図１４は、文書クラス「技術報告書」に従った文書構造の例を示す。この文書構造例４０は、文書パーサ２１が入力文書データ（図示せず）から文書クラス「技術報告書」の定義情報に基づき抽出したものである。
【００８１】
図１５は、文書構造例４０の内容分割の様子を示す。図８（Ａ）に示した内容分割規則３１の第１の規則要素の入力ノード条件にマッチする「段落２」ノード４１ａは、内容分割処理によって「段落２」ノード４２ａとなる。
【００８２】
この時、「段落２」ノード４１ａの内容４１ｂは、内容分割規則３１に示した分割パターン「＜．＊＞」にマッチする＜文書構造＞、＜文書クラス＞、＜構造変換規則＞という文字列で分割される。
【００８３】
そして、「リード１」ノード４２ｃａと、「項目１」ノード４２ｃｂと、「項目２」ノード４２ｃｃと、「項目３」ノード４２ｃｄと、が作成される。これら新たなノードには、内容４２ｂａ、内容４２ｂｂ、内容４２ｂｃ、内容４２ｂｄが連結される。
【００８４】
図１６は、文書構造例４０を文書クラス「Ｒｅｐｏｒｔ」に従った文書構造に変換した時の出力文書構造を示す。出力文書構造４３は文書クラス「Ｒｅｐｏｒｔ」の定義に従っており、この文書構造変換が成功していることが判る。
【００８５】
なお、上記の説明では内容分割の際に作成されるノードを、「リード」と「項目」とにしたが、入力文書クラス及び出力文書クラスの定義情報と、内容分割規則と、構造変換規則と、で統一性を取ることができれば、別のタイプのノードを作成、変換することもできる。
【００８６】
【発明の効果】
以上説明したように、本発明の文書構造変換装置により、第１の文書クラスに従った文書構造中の特定のノードをより詳細なノードに分割して細分化文書構造を作成し、作成した細分化文書構造中の全てのノードを第２の文書クラスの定義に含まれるノードに変換することができる。即ち、第１の文書クラスに従った第１の文書構造を、第１の文書クラスの構造制約より更に細かい構造制約を持つ第２の文書クラスに従った第２の文書構造に変換することができる。
【００８７】
また、ノードの分割及びその位置を判断する基準となる文字列は、内容分割規則中に分割パターンとして記述される、所定の規則に則った文字列から導出できるため、文書クラスに応じて適宜内容分割規則を定めることで、希望する文書構造の作成が可能となる。
【００８８】
更に、ノードの分割の基準とした文字列を、分割後のノードから削除する指示を行うことができ、変換後の文書構造に不要な文字列を含めないように文書構造の変換を行うことができる。
【００８９】
また、本発明の文書構造変換方法により、構成ノードの内容の細かさに対する方針が異なった文書クラス間でも、ユーザの希望通りに、文書構造の変換を行うことができる。
【図面の簡単な説明】
【図１】本発明の原理図である。
【図２】本発明の文書構造変換装置のブロック図である。
【図３】文書構造変換の手順を示すフローチャートである。
【図４】文書クラス「技術報告書」のノードのタイプ定義を示す。
【図５】文書クラス「技術報告書」の構造制約を示す。
【図６】文書クラス「Ｒｅｐｏｒｔ」のタイプ定義を示す。
【図７】文書クラス「Ｒｅｐｏｒｔ」の構造制約を示す。
【図８】（Ａ）は、入力文書クラス名が「技術報告書」、出力文書クラス名が「Ｒｅｐｏｒｔ」である内容分割規則を示す。（Ｂ）は、第１の内容分割規則要素の入力ノード条件を満たすノードを示す。
【図９】入力文書クラス名が「技術報告書」、出力文書クラス名が「Ｒｅｐｏｒｔ」である構造変換規則を示す。
【図１０】細分化文書構造作成の手順を示すフローチャートである。
【図１１】内容分割処理の手順を示すフローチャート（その１）である。
【図１２】内容分割処理の手順を示すフローチャート（その２）である。
【図１３】構造変換処理の手順を示すフローチャートである。
【図１４】文書クラス「技術報告書」に従った文書構造の例を示す。
【図１５】文書構造例の内容分割の様子を示す。
【図１６】文書構造例を変換した出力文書構造を示す。
【符号の説明】
１第１の文書構造
２第２の文書構造
３内容分割規則管理手段
４分割規則適用要素決定手段
５内容分割手段
６構造変換規則管理手段
７構造変換手段[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a document structure conversion apparatus and a document structure conversion method for converting a document structure of a structured document, and in particular, a document structure for converting a document structure according to a first document class to a document structure according to a second document class. The present invention relates to a conversion device.
[0002]
[Prior art]
A document that has a logical structure represented by chapters, sections, and sections as structural elements is called a structured document. By unifying the structure of this document, sharing and conversion of the document can be facilitated. Are known. With the spread of international standards such as SGML (Standard Generalized Markup Language) and ODA (Office Document Architecture), structured documents are becoming the mainstream of electronic documents.
[0003]
This structured document is usually structured according to a classification called a document class that defines the structure and components of the document. In ODA, the common logical structure corresponds to the document class, and in SGML, DTD (Data Type Definition) plays the role of the document class. It should be noted that in the following description, all documents indicate structured documents.
[0004]
It is important that the structure of a document follows the constraints of the document class. For example, a rule for laying out a plurality of documents is often determined on the assumption that the document structure and its constituent elements satisfy the constraints of a specific document class. For this reason, when the structure of a certain document deviates from the document class restriction, it is not possible to output with a correct layout.
[0005]
In addition, many programs for processing many documents, such as creating an abstract list from a group of reports, utilize the fact that the target document is created according to a specific document class. When such a program is used, the presence of a document that does not follow a predetermined document class can be an obstacle to program execution.
[0006]
Furthermore, in a database that targets a large number of documents, a document class is often used as a schema. Therefore, the existence of data deviating from the schema greatly impairs the reliability of the database.
[0007]
That is, when a program for a document having a document structure according to document class A is used, no matter how similar the document structure is, it can be applied to a document having a document structure according to another document class B. difficult. In such a case, it is necessary to convert the document structure according to the document class B into a document structure according to the document class A.
[0008]
In order to solve the above problem, there are several techniques for converting an arbitrary document structure into another document structure. The following description will be made assuming that all document structures can be represented by a tree structure.
[0009]
First, “Structure of Applied Structured-Documents and Solutions (Electronic Publishing, vol. 5, No. 4, pp. 209-216) written by Fred Cole, Heather Brown”, “ck ssb”. Thus, the document structure can be converted. This is to define as a rule which node type specified in the document class B is converted into the node type specified in the document class A. When the conversion is performed, each node included in the document structure according to the document class B is converted one by one into a node included in the document structure according to the document class A based on a predefined rule.
[0010]
Japanese Laid-Open Patent Publication No. 7-28817 discloses a method in which a user defines a one-to-one correspondence between a type constituting a document class B and a type constituting a document class A. The document structure is converted by using the defined correspondence.
[0011]
[Problems to be solved by the invention]
However, when the details of the contents to be given to the nodes constituting the document structure are different between the document class A and the document class B, the desired document structure cannot be obtained by these methods.
[0012]
For example, in the document class A, the information of the author of the document is the content of a single node “author”, whereas in the document class B, the author's surname is “name” in the content of the node “last name”. Suppose the author's name is placed in the contents of the node.
[0013]
In such a case, it is impossible to convert the document structure according to the document class A into a document structure according to the document class B, regardless of which method is used. In these methods, it is impossible to realize a function of dividing a node in a document structure according to the contents and converting the contents into contents of different nodes.
[0014]
There is a method of dividing a non-structured document according to its contents, and converting it into different nodes and its contents to create a document structure according to a desired document class.
[0015]
Japanese Patent Laid-Open No. 62-249270 discloses a document processing apparatus that has words used as headings in a dictionary, determines headings of a document input from a user using the dictionary, and structurally displays the documents. Is disclosed.
[0016]
Japanese Patent Application Laid-Open No. 4-4467 discloses a document structure analysis apparatus that registers a correspondence relationship between a character string pattern and a document structure and analyzes the document structure by matching with input document data. Yes.
[0017]
However, none of the above techniques can be used to normalize the document structure. That is, it is impossible to convert the document structure of a document that has already been structured according to an arbitrary document class into a document structure according to another document class.
[0018]
The present invention has been made in view of the above points, and is a document structure conversion capable of converting a document structure to another document structure, and arbitrarily dividing the contents of the constituent nodes of the original document structure. An object is to provide an apparatus.
[0019]
[Means for Solving the Problems]
In the present invention, in order to solve the above-described problem, a document structure conversion apparatus for converting a document structure according to a first document class into a document structure according to a second document class, according to the first document class. Indicates the application condition that indicates the condition of the node to be divided in the document structure and the division position of the content of the node to be divided Specified by the pattern of the delimiter string Content division rule management means for managing content division rules comprising division patterns, nodes in the document structure according to the first document class, and nodes generated by the division are the second document class A structure conversion rule managing means for managing a structure conversion rule that defines which type of node in the document structure to be converted is defined, and a first document structure according to the first document class are input. Then, a division rule application element determination unit that determines a division target node that satisfies the condition indicated by the application condition from the first document structure, and the division target node based on a position that matches the division pattern Content dividing means for dividing the contents of the first document structure, connecting a plurality of fragment nodes having the divided contents individually to predetermined positions of the first document structure, and creating a fragmented document structure; There is provided a document structure conversion apparatus comprising: structure conversion means for converting the subdivided document structure created by the content dividing means according to a structure conversion rule to create a second document structure. .
[0020]
According to this document structure conversion apparatus, when the first document structure according to the first document class is input, the division rule application element determination means indicates the application condition from among the first document structures. A node to be divided that satisfies the condition is determined. Next, the content dividing means Specified by the pattern of the delimiter string The content of the node to be divided is divided on the basis of the position matching the division pattern, and a plurality of fragment nodes each having the divided content are connected to a predetermined position of the first document structure, so that the subdivided document structure is Created. Then, the subdivided document structure is converted by the structure conversion unit according to the structure conversion rule, and a second document structure is created. The second document structure created in this way conforms to the second document class having a structural constraint that is more detailed than the first document class.
[0021]
In the document structure conversion method for converting the document structure according to the first document class to the document structure according to the second document class, the document structure conversion method for converting the document structure according to the first document class from the first document structure according to the first document class, A node to be divided that satisfies an application condition indicating a condition of a node to be divided in the document structure according to the first document class; By the division rule application element decision means Determine the division position of the contents of the node to be divided Stipulated by the character string pattern The content of the node to be divided is based on the position that matches the division pattern. By content dividing means Dividing the fragmented document structure by connecting a plurality of fragment nodes individually having the divided contents to a predetermined position of the first document structure. By the content dividing means The node generated in the document structure according to the first document class and the node generated by being divided is converted into any type of node in the document structure according to the second document class. A document structure conversion method is provided that converts the fragmented document structure and creates a second document structure in accordance with a structure conversion rule defined as follows.
[0022]
According to this document structure conversion method, the contents of the division target node of the first document structure are: Specified by the pattern of the delimiter string The division is performed based on the position that matches the division pattern. Then, a fragmented document structure is generated in which a plurality of fragment nodes having individually divided contents are connected to predetermined positions of the first document structure. Further, this subdivided document structure is converted according to the structure conversion rule, and a second document structure is created. In this way, a second document structure that conforms to the second document class having a structural constraint that is more detailed than the first document class is created.
[0023]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, embodiments of the present invention will be described with reference to the drawings. FIG. 1 shows the principle of the present invention.
[0024]
In the present invention, the first document structure 1 according to the first document class is converted into the second document structure 2 according to the second document class.
The content division rule managing means 3 is a content division rule comprising an application condition indicating a condition of a node to be divided in the first document structure and a division pattern indicating a division position of the content of the node to be divided. Is managing. The structure conversion rule management means 6 is a structure that defines which type of node in the second document structure is converted into the node in the first document structure and the node generated by the division. Manages conversion rules.
[0025]
When the first document structure is input, the division rule application element determination unit 4 determines a division target node that satisfies the condition indicated by the application condition from the first document structure according to the content division rule.
[0026]
The content dividing means 5 divides the content of the division target node according to the division rule. Then, a fragmented document structure is created by connecting a plurality of fragment nodes having the divided contents to the first document structure.
[0027]
The structure conversion means 7 converts the subdivided document structure into the second document structure according to the structure conversion rule.
FIG. 2 is a block diagram of the document structure conversion apparatus of the present invention.
[0028]
An input / output u / i 10 is connected to the document structure conversion apparatus 20 and plays a role of receiving input from a user and displaying output.
The document structure conversion apparatus 20 includes a document parser 21 that extracts a document structure of input document data, a document class management unit 22 that manages document class definition information, and a conversion that manages rules used for document conversion. The rule management unit 23, a division processing unit 24 that divides the contents of the input document structure, a structure conversion unit 25 that converts the document structure, and a document generator 26 that outputs the conversion result as document data. Yes.
[0029]
As the input data, the user specifies the document data (input document data) whose structure is to be converted, the document class name (input document class name) that defines the document structure of the input document data, and the document structure to be output. The document class name (output document class name) is input to the document structure conversion apparatus 20 via the input / output u / i 10.
[0030]
When the document parser 21 receives the input data, it first inputs the input document class name to the document class management unit 22 and requests its definition information. When the definition information of the input document class is supplied from the document class management unit 22, the input document data is analyzed based on the definition information, and the document structure (input document structure) is obtained. Then, the input document class name and the output document class name are input to the conversion rule management unit 23, and the input document structure is input to the division processing unit 24.
[0031]
The document class management unit 22 manages all document class definition information that can be handled by the document structure conversion apparatus 20 of the present invention. When an input document class name is input from the document parser 21, the definition information of the input document class is supplied.
[0032]
The conversion rule management unit 23 manages conversion rules used for document structure conversion. Here, the conversion rule refers to a content division rule that is a rule used when dividing the contents of a node and a structure conversion rule that is a rule used when converting the type of a node. The conversion rule management unit 23 includes a content division rule management unit 23a that manages content division rules and a structure conversion rule management unit 23b that manages structure conversion rules.
[0033]
In the content division rule management unit 23a, the content division rule is managed as a set together with the input document class name and the output document class name. Similarly, the structure conversion rule management unit 23b manages structure conversion rules as a set with the input document class name and the output document class name.
[0034]
When the input document class name and the output document class name are input to the conversion rule management unit 23, the content division rule management unit 23a searches for a content division rule that is set with both document class names. If there is an appropriate content division rule as a result of the search, the corresponding content division rule is supplied to the division rule element determination unit 24a. The structure conversion rule management unit 23b also searches for structure conversion rules that are set with both document class names, and supplies the corresponding structure conversion rules to the structure conversion unit 25.
[0035]
The division processing unit 24 includes a division rule application element determination unit 24a that determines an application element from the supplied content division rule, and a content division unit 24b.
An input document structure is input from the document parser 21 to the division processing unit 24. When the content division rule corresponding to the input data is supplied from the conversion rule management unit 23, the application element is determined and the content is divided sequentially for all the nodes in the input document structure, and connected to the input document structure. A segmented document structure is created and input to the structure conversion unit 25. If the content division rule is not supplied from the conversion rule management unit 23, the input document structure is input to the structure conversion unit 25 as it is.
[0036]
At this time, the division rule application element determination unit 24a determines which rule element in the content division rule supplied from the conversion rule management unit 23a is applied to each node in the input document structure. In addition, the content division unit 24b decomposes the character string content of the node based on the division pattern in the content division rule element, and creates a plurality of fragment nodes. Thereafter, the fragment node thus created is connected to the input document structure to create a fragmented document structure.
[0037]
The subdivision document structure or the input document structure is input from the division processing unit 24 to the structure conversion unit 25, and the structure conversion rule is supplied from the conversion rule management unit 23. The input document structure is converted based on the structure conversion rule here, and is input to the document generator 26 as a document structure (output document structure) according to the output document class.
[0038]
The document generator 26 receives the output document structure from the structure conversion unit 25. The input output document structure is converted into document data and provided to the user via the input / output u / i.
[0039]
Here, the procedure of document structure conversion by the document structure conversion apparatus of the present invention will be described. FIG. 3 is a flowchart showing a procedure for document structure conversion. Hereinafter, description will be made along the steps in the figure.
[S1] Accept input data via the input / output u / i10. The input data indicates an input document class name, an output document class name, and input document data.
[S2] The document parser 21 inputs the input document class name and the output document class name to the conversion rule management unit 23. Upon receiving the input of both document class names, the conversion rule management unit 23 extracts any conversion rules managed as a set with both document classes. That is, if the corresponding content division rule is managed by the content division rule management unit 23a, the content division rule is supplied to the division rule application element determination unit 24a. If the corresponding structure conversion rule is managed by the structure conversion rule management unit 23b, the structure conversion rule is supplied to the structure conversion unit 25.
[S3] The document parser 21 determines whether or not the conversion rule management unit 23 has extracted the structure conversion rule. If it has been removed, the process proceeds to step S4. If it cannot be extracted, an error message is output via the input / output u / i, and the process is terminated. If the structure conversion rules cannot be retrieved, the user should check that the input document class name and output document class name are correct, and that the structure conversion rules for document structure conversion between both document classes are correctly managed. is there.
[S4] The document parser 21 inputs the input document class name to the document class management unit 22 to obtain definition information of the input document class. Then, the input document data is analyzed based on the obtained definition information and input to the division processing unit 24 as an input document structure.
[S5] The division processing unit 24 determines whether or not a content division rule is supplied from the conversion rule management unit 23. If the content division rule is supplied, the process proceeds to step S6. If not supplied, the input document structure is input to the structure conversion unit 25, and the process proceeds to step S7.
[S6] The division processing unit 24 divides the contents of the input document structure to create a fragmented document structure, and inputs it to the structure conversion unit 25. The procedure for creating the segmented document structure will be described in detail later.
[S7] The structure conversion unit 25 converts the input fragmented document structure or the input document structure based on the supplied structure conversion rule, and creates a document structure (output document structure) according to the output document class. Then, the created output document structure is input to the document generator 26. The procedure of the structure conversion process will be described in detail later.
[S8] The document generator 26 uses the input output document structure as document data, and provides it to the user via the input / output u / i.
[0040]
Here, each part which comprises the document structure conversion apparatus of this invention is demonstrated still in detail.
The document parser 21 receives input data from the user via the input / output u / i 10, inputs the input document class name and the output document class name to the conversion rule management unit 23, and structural conversion conforming to the input data It is determined whether or not the rule is managed by the conversion rule management unit 23. If the structure conversion rules are managed as a result of the determination, the input document structure is extracted from the input document data, and the extracted input document structure is input to the division processing unit 24. If the structure conversion rule is not managed as a result of the determination, an error message is output.
[0041]
The document class management unit 22 manages document classes. Here, one document class is stored as a pair of the document class name and the definition information of the document class. A plurality of document classes are stored in the document class management unit 22, but the name of each document class is unique within the document structure conversion apparatus 20.
[0042]
The definition information of the document class is composed of a type definition of nodes constituting the document structure and a structure constraint that defines a connection relation of the defined nodes. A document structure that satisfies the definition information of a specific document class is called a document structure of that document class. Further, document data having a document structure of the document class is referred to as document data of the document class.
[0043]
The type definition of the elements (nodes) constituting the document structure includes the following two elements. That is, a type name that is a character string for identifying the type of the node, and a content type designation that indicates the type of content that the node has. Here, the content type designation is one of three types: “has no content”, “has a character string type content”, or “has a geometric figure type content”. In this embodiment, it is assumed that all nodes having a lower structure have no content.
[0044]
A type whose name is “A” is called “A type”, and a “A type node” is called “A node”.
The structure constraint that defines the connection relation of the defined nodes is defined by a tree structure created from the structure constraint shown below and the node type definition. There are three types of structure constraints, SEQ, REP, and OPT, and each has the following meaning.
[0045]
The structure restrictor SEQ has a plurality of substructures, and indicates that the structures defined by the substructures appear in a defined order.
The structure restrictor REP takes a single substructure and indicates that the structure defined by the substructure appears repeatedly one or more times.
[0046]
The structure constrainer OPT takes a single substructure, and indicates whether the structure defined by the substructure appears once or does not appear.
The content division rule management unit 23a included in the conversion rule management unit 23 manages content division rules. Here, the content division rule is a list of content division rule elements stored as a set of an input document class name and an output document class name.
[0047]
The content division rule element includes the following three elements: an input node condition, a division pattern, and a head deletion flag.
In the input node condition, a condition for the input node to which the rule element is applied is described. This condition includes a condition regarding the type of the input node, the parent node, the sibling node, and the like of the input node.
[0048]
The division pattern defines a character string pattern to be used as a delimiter when the character type content of the input node that matches the input node condition is divided. By dividing the content, a plurality of fragment nodes are created.
[0049]
The head deletion flag indicates whether or not to delete a portion that matches the division pattern from the created fragment node. When deleting, “T” is described, and when not deleting, “F” is described.
[0050]
Here, the function of the head deletion flag will be described. The character string used as the division pattern includes the text content start and order such as “(1)” and “●”, “[Logical structure] What is a logical structure ...”, “Document” Some of the texts have headlines such as “Class: Document class ...”. However, since many structured document layouters have a numbering function, it may be better to delete symbols such as “(1)” and “●”.
[0051]
By providing a head deletion flag in the content division rule element, it becomes possible to control according to the state of the character string that matches those division patterns. That is, it is possible to make fine settings when dividing the contents of the node.
[0052]
The structure conversion rule management unit 23b included in the conversion rule management unit 23 manages the structure conversion rules. Here, the structure conversion rule is a list of structure conversion rule elements stored as a set of an input document class name and an output document class name.
[0053]
The structure conversion rule element is composed of the following two elements: an input node condition and an output node.
In the input node condition, a condition for the input node to which the rule element is applied is described. This condition includes a condition regarding the type of the input node, the parent node, the sibling node, and the like of the input node.
[0054]
The output node describes which type of node in the output document structure is the input node that matches the input node condition.
When the input document structure is input, the division processing unit 24 divides the content according to the supplied content division rule to create a subdivided document structure.
[0055]
At this time, if the content division rule is supplied, the division rule application element determination unit 24a searches for applicable rule elements for all the constituent nodes of the input document structure. When a rule element applicable to a certain node can be searched, the detailed information is input to the content dividing unit 24b.
[0056]
When the detailed information is input from the division rule application element determination unit 24a, the content dividing unit 24b divides the content of the node based on the information, and generates a fragment node.
The generated fragment node is connected to the input document structure in the division processing unit 24, and a fragmented document structure is created. The segmented document structure created here is input to the structure conversion unit 25. If the content division rule is not supplied from the content division rule management unit 23a, the division processing unit 24 does not perform any processing on the input document structure and inputs it to the structure conversion unit 25 as it is. The processing procedure in the division processing unit 24 will be described later in order.
[0057]
The structure conversion unit 25 receives the supply of the structure conversion rule from the structure conversion rule management unit 23b. Further, the division processing unit 24 receives an input of a subdivided document structure or an input document structure. Thereafter, the input document structure is converted based on the supplied structure conversion rule, and a document structure (output document structure) according to the output document class is created. The generated output document structure in the tree structure state is input to the document generator 26. The processing procedure in the structure conversion unit 25 will be described later in order.
[0058]
The document generator 26 converts the input output document structure into document data. Then, it is provided to the user via the input / output u / i 10, and the document structure conversion process is terminated.
[0059]
Next, explanation will be given with a specific example. It is assumed that the document class management unit 22 manages the document class “technical report” and the document class “Report”.
FIG. 4 shows the type definition of the nodes constituting the document structure in the definition information of the document class “technical report” managed by the document class management unit 22.
[0060]
According to the type definition 30a of the “technical report”, a “technical report” node having no content in the document class, a “title” node having a character string type content, and a character string type content “ It can be seen that there are an “author” node, a “section” node having no content, a “heading” node having a character string type content, and a “paragraph” node having a character string type content.
[0061]
FIG. 5 shows structural constraints that define the connection relationship of the defined nodes in the definition information of the document class “technical report”.
According to the structure restriction 30b of “Technical Report”, this document class has a “Technical Report” node as a root node, a subordinate structure first having a “Title” node, then an “Author” node, and finally “ You can see that it has a “node” node. There may be a plurality of “node” nodes. Further, one “section” node has a “heading” node as a lower structure, and then a “paragraph” node. Here, a plurality of “paragraph” nodes may exist under one “section” node.
[0062]
Similarly, FIG. 6 shows the type definition of the document class “Report”.
According to the type definition 33a of “Report”, an “Rport” node having no content in the document class, a “Title” node having a character string type content, and an “Author” node having a character string type content, A “Words” node having no content, a “Section” node having no content, a “Paragraph” node having a character string type content, a “Lead” node having a character string type content, and a character string type It can be seen that there is an “Item” node having the following content.
[0063]
FIG. 7 shows the structure constraints of the document class “Report”.
According to the structure constraint 33b of “Report”, this document class has an “Rport” node as a root node, a “Title” node as a lower structure, an “Author” node, and then a “Words” node. Finally, it can be seen that it has a “Section” node. Note that there may be a plurality of “Section” nodes. In addition, the “Words” node has a “Title” node and then a “Paragraph” node as a subordinate structure. Here, a plurality of “Paragraph” nodes may exist. Furthermore, the “Paragraph” node may have a “Lead” node and then an “Item” node as a subordinate structure. Here, a plurality of “Item” nodes may exist. Further, according to this structural constraint, it can be seen that the substructure of the “Section” node is exactly the same as the substructure of the “Words” node.
[0064]
When the document class definition information as described above is managed by the document class management unit 22, the conversion rule management unit 23 notifies the conversion rule management unit 23 that the input document class name is "Technical Report" and the output document class name is "Report." It is assumed that the content division rule and the structure conversion rule which are “” are managed.
[0065]
FIG. 8A shows a content division rule in which the input document class name is “Technical Report” and the output document class name is “Report”.
The content division rule 31 has two content division rule elements. The input node condition of the first content division rule element 31a is to have a “node” type node as a child node having a “heading” type node having a content including a character string “term” as a parent node. “Paragraph” type node ”, the division pattern is“ <. *> ”, And the head deletion flag is“ F ”.
[0066]
FIG. 8B shows a node that satisfies the input node condition of the first content division rule element 31a. As in the input node condition of the first content division rule element 31a, the “paragraph” node 31aa has a “section” node having a “heading” node having a content including the character string “term” as a child node, and a parent node Have as a node. Thus, the input node condition can include the contents of the parent node and sibling node as conditions.
[0067]
Here, “.” Included in “<. *>” Of the division pattern represents one arbitrary character, and “*” represents that the immediately preceding element is repeated zero or more times. Accordingly, the divided pattern “<. *>” Matches an arbitrary character string enclosed by “<” and “>”. Furthermore, since the head deletion flag of the first rule element is “F”, when the contents of the node are divided at the position of the character string that matches the division pattern “<. *>”, The matched character string is deleted. There is no need to do it.
[0068]
The input node condition of the second content division rule element 31b of the content division rule 31 is “type is“ paragraph ””, the division pattern is “<[0-9] +>”, and the head deletion flag is “T”. It is. Here, “[0-9]” included in the division pattern “<[0-9] +>” is any one of 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9. One character, “+”, indicates that the immediately preceding element is repeated one or more times. Accordingly, the division pattern “<[0-9] +>” matches an arbitrary number string surrounded by “<” and “>”. Furthermore, since the head deletion flag of the second rule element is “T”, when the content of the node is decomposed at the position of the character string that matches the division pattern “<[0-9] +>”, the matched character Delete the column.
[0069]
Here, the order of the content division rule elements in the content division rule 31 is significant. That is, when searching for content division rule elements that match a certain node, the input node conditions of the content division rule elements are always checked in the order of the first, second,... When it is found, the search operation is terminated.
[0070]
In this example, among the fragment nodes generated from the contents of the nodes that are content division targets, the fragment node that is first divided is a node of the type “lead”. Further, the fragment node divided into the second and subsequent parts is made a node of the “item” type.
[0071]
FIG. 9 shows a structure conversion rule in which the input document class name is “Technical Report” and the output document class name is “Report”.
According to the structure conversion rule 32, it can be seen that a node whose input type name is “Technical Report” is converted to a node whose output type name is “Report”. Similarly, the input node “title” is included in the output node “Title”, the input node “author” is included in the output node “Author”, and the input node “section” includes the character string “term”. Those having a “heading” node as a child node are converted into an output node “Words”. The input node “clause” that has not been converted to the output node “Words” is the output node “Section”, the input node “headline” is the output node “Title”, and the input node “paragraph” is the output node “ In the “Paragraph”, the input node “Lead” is converted into the output node “Lead”, and the input node “Item” is converted into the output node “Item”.
[0072]
A description will be given of how the document data of the “technical report” is converted when such information is previously managed by the document structure conversion apparatus 20.
First, the detailed document structure creation procedure described in step S6 of FIG. 3 will be described in detail. This process is performed in the division processing unit 24.
[0073]
FIG. 10 is a flowchart showing a procedure for creating a segmented document structure. Hereinafter, description will be made along the steps in the figure.
In this process, the work is advanced focusing on one of the nodes (input nodes) in the input document structure. Here, the node of interest is called a current node, and when starting processing, the root node is a current node.
[S11] The content division rule application element determination unit 24a searches for a rule element applicable to the current node from the supplied content division rule. The rule element that can be searched is called a current rule element.
[S12] It is determined whether or not a current rule element exists. If it exists, the process proceeds to step S13, and if it does not exist, the process proceeds to step S14.
[S13] Since the current node and the current rule element have been determined, the content division process is performed, and the creation of the segmented document structure is terminated. The procedure of content division processing will be described in detail later.
[S14] It is determined whether or not the current node has a child node. If there is a child node, the process proceeds to step S15. If there are no child nodes, creation of the segmented document structure is terminated.
[S15] The first child node of the current node is set as a new current node.
[S16] A fragmented document structure is created for a new current node. That is, the process from step S11 to the end of this flowchart is performed for the new current node.
[S17] It is determined whether or not the current node has a younger brother. If there is a younger brother, the process proceeds to step S18. If there is no younger brother, the creation of the fragmented document structure is completed.
[S18] The younger brother immediately below the current node is set as a new current node, and the process proceeds to step S16 again.
[0074]
Here, the content division processing described in step S13 of FIG. 10 will be described. This processing is performed when detailed information (current node and current rule element) is input from the division rule application element determination unit 24a in the content division unit 24b included in the division processing unit 24.
[0075]
11 and 12 are flowcharts showing the procedure of content division processing. Hereinafter, description will be made according to the steps in the figure. This process is performed by the division processing unit 24, and the procedure shown here is based on the content division rule 31 shown in FIG.
[0076]
Also, in this process, the content division process is advanced using the following three variables indicating the processing position in the content: previous-start, previous-end, and current position.
[S21] Previous-start and previous-end are undecided.
[S22] The current content node starts from the current position.
[S23] Search backward from the current position for a character string that matches the division pattern of the current current rule element.
[S24] It is determined whether a character string matching the division pattern has been searched. If it can search, it will progress to step S25. If the search is not possible, the process proceeds to step S33.
[S25] It is determined whether or not previous-start is undetermined. If it is not yet determined, the process proceeds to step S26, and if not determined, the process proceeds to step S28.
[S26] Create a node of type “lead” and make it the eldest son of the current node.
[S27] The content from the beginning of the content to the start position of the character string that matches the division pattern is made the content of the “read” node created in step S26.
[S28] Create a node of type "item" and make it the last child of the current node.
[S29] It is determined whether the head deletion flag of the current rule element is “T”. If “T”, the process proceeds to step S30, and if “F”, the process proceeds to step S31.
[S30] The contents from the previous-end to the start position of the character string matching the division pattern are the contents of the “item” node created in step S28.
[S31] The contents from the previous-start to the start position of the character string matching the division pattern are the contents of the “item” node created in step S28.
[S32] The start position of the character string that matches the division pattern is set to previous-start. Also, the end position of the character string that matches the division pattern is set to previous-end and current position. Thereafter, the process proceeds to step S23 again.
[S33] It is determined whether or not previous-start is undetermined. If it is undecided, the content of the current current node does not have a character string that matches the division pattern, and it is not necessary to divide the content, so the content division process ends. If not determined, the process proceeds to step S34.
[S34] Create a node of type "item" and make it the last child of the current node.
[S35] It is determined whether the head deletion flag of the current rule element is “T”. If “T”, the process proceeds to step S36, and if “F”, the process proceeds to step S37.
[S36] The content from the previous-end to the end of the content is set to the content of the “item” node created in step S34, and the content division processing is terminated.
[S37] The content from the previous-start to the end of the content is set to the content of the “item” node created in step S34, and the content division processing is terminated.
[0077]
Next, the procedure of the structure conversion process described in step S7 in FIG. 3 will be described in detail. This process is performed in the structure conversion unit 25 that receives the supply of the structure conversion rule 32 shown in FIG.
[0078]
FIG. 13 is a flowchart showing the procedure of the structure conversion process. Hereinafter, description will be made along the steps in the figure.
In this process, one of the input document structure nodes is used as a current node, and one of the document structure nodes created by the conversion process is used as a parent node. In the first stage of processing, the root node of the input document structure is set as a current node. At this time, the parent node is undecided.
[S41] The structure conversion rule element applicable to the current node (matching the input node condition) is searched from the conversion rules.
[S42] A node of the type specified by the structure conversion rule element is created.
[S43] It is determined whether or not the current node is the root node. If it is the root node, the process proceeds to step S45, and if it is not the root node, the process proceeds to step S44.
[S44] The fact that the current node is not the root node means that a parent node already exists. The node created in step S42 is linked as the last child of the parent node.
[S45] The node created in step S42 is set as a new parent node.
[S46] It is determined whether or not the current node has a child node. If it has a child node, the process proceeds to step S47. If there is no child node, the structure conversion process is terminated.
[S47] Set the current node's eldest son as the new current node
[S48] A structure conversion process is performed on the new current node and the new parent node. That is, the process from step S41 to the end of this flowchart is performed for the new current node and the new parent node.
[S49] It is determined whether the current node has a younger brother. If it has a younger brother, it will progress to step S50. If there is no younger brother, the structure conversion process is terminated.
[S50] The younger brother of the current node is set as a new current node, and the process proceeds to step S48 again.
[0079]
The document structure is converted according to the procedure described above. Here, the document data having the document structure according to the document class “Technical Report” is converted into the document data having the document structure according to the document class “Report”. An example of conversion to
[0080]
FIG. 14 shows an example of a document structure according to the document class “technical report”. This document structure example 40 is extracted from the input document data (not shown) by the document parser 21 based on the definition information of the document class “technical report”.
[0081]
FIG. 15 shows a state of content division in the document structure example 40. The “paragraph 2” node 41a that matches the input node condition of the first rule element of the content division rule 31 shown in FIG.
[0082]
At this time, the content 41b of the “paragraph 2” node 41a is a character string of <document structure>, <document class>, and <structure conversion rule> that matches the division pattern “<. *>” Shown in the content division rule 31. Divided by.
[0083]
Then, a “lead 1” node 42ca, an “item 1” node 42cb, an “item 2” node 42cc, and an “item 3” node 42cd are created. Content 42ba, content 42bb, content 42bc, and content 42bd are connected to these new nodes.
[0084]
FIG. 16 shows an output document structure when the document structure example 40 is converted into a document structure according to the document class “Report”. The output document structure 43 conforms to the definition of the document class “Report”, and it can be seen that this document structure conversion is successful.
[0085]
In the above description, the nodes created at the time of content division are “lead” and “item”. However, the definition information of the input document class and the output document class, the content division rule, the structure conversion rule, If you can achieve unity, you can create and convert other types of nodes.
[0086]
【The invention's effect】
As described above, the document structure conversion apparatus of the present invention creates a subdivided document structure by dividing a specific node in the document structure according to the first document class into more detailed nodes, All nodes in the structured document structure can be converted to nodes included in the definition of the second document class. In other words, the first document structure according to the first document class may be converted into the second document structure according to the second document class having a structural constraint finer than the structural constraint of the first document class. it can.
[0087]
In addition, the character string that serves as a reference for determining node division and its position can be derived from a character string that conforms to a predetermined rule and is described as a division pattern in the content division rule. By defining the division rule, it is possible to create a desired document structure.
[0088]
Further, it is possible to instruct to delete the character string used as a node division criterion from the node after the division, and to convert the document structure so that unnecessary character strings are not included in the converted document structure. it can.
[0089]
Further, according to the document structure conversion method of the present invention, the document structure can be converted as desired by the user even between document classes having different policies regarding the fineness of the contents of the constituent nodes.
[Brief description of the drawings]
FIG. 1 is a principle diagram of the present invention.
FIG. 2 is a block diagram of a document structure conversion apparatus according to the present invention.
FIG. 3 is a flowchart illustrating a procedure for document structure conversion.
FIG. 4 shows a node type definition of a document class “Technical Report”.
FIG. 5 shows the structural constraints of the document class “Technical Report”.
FIG. 6 shows a type definition of a document class “Report”.
FIG. 7 shows structure constraints of a document class “Report”.
FIG. 8A shows a content division rule in which the input document class name is “Technical Report” and the output document class name is “Report”. (B) shows a node that satisfies the input node condition of the first content division rule element.
FIG. 9 shows a structure conversion rule whose input document class name is “Technical Report” and whose output document class name is “Report”.
FIG. 10 is a flowchart showing a procedure for creating a fragmented document structure.
FIG. 11 is a flowchart (part 1) illustrating a procedure of content division processing;
FIG. 12 is a flowchart (part 2) illustrating a procedure of content division processing;
FIG. 13 is a flowchart illustrating a procedure of structure conversion processing.
FIG. 14 shows an example of a document structure according to a document class “technical report”.
FIG. 15 shows a state of content division in a document structure example.
FIG. 16 shows an output document structure obtained by converting a document structure example.
[Explanation of symbols]
1 First document structure
2 Second document structure
3 Content division rule management means
4. Division rule application element decision means
5 Content division means
6 Structure conversion rule management means
7 Structure conversion means

Claims

In a document structure conversion device for converting a document structure according to a first document class into a document structure according to a second document class,
A division pattern defined by an application condition indicating a condition of a node to be divided in the document structure according to the first document class, and a character string pattern as a delimiter indicating a division position of the contents of the node to be divided Content division rule management means for managing content division rules consisting of:
Defines which type of node in the document structure according to the second document class is converted into the node in the document structure according to the first document class and the node generated by the division. A structure conversion rule management means for managing the structured conversion rules,
When a first document structure according to the first document class is input, a division rule application element determination that determines a division target node that satisfies the condition indicated by the application condition from the first document structure. Means,
The contents of the node to be divided are divided on the basis of the position matching the division pattern, and a plurality of fragment nodes each having the divided contents are connected to predetermined positions of the first document structure to be subdivided. Content dividing means for creating a document structure;
In accordance with the structure conversion rule, structure conversion means for converting the fragmented document structure created by the content dividing means to create a second document structure;
A document structure conversion apparatus comprising:

The content division rule management means represents a character string according to a predetermined rule as the division pattern,
The content dividing means divides the content of the node to be divided with reference to the position of a character string that matches the division pattern.
The document structure conversion apparatus according to claim 1, wherein:

The content division rule management means includes, as the content division rule, deletion information indicating whether or not to delete a character string that matches the division pattern,
The content dividing means does not include a character string suitable for the division pattern in the contents of the plurality of fragment nodes when the deletion information indicates deletion of the character string.
3. The document structure conversion apparatus according to claim 2, wherein

In a document structure conversion method for converting a document structure according to a first document class into a document structure according to a second document class,
From the first document structure in accordance with the first document class, dividing the first application satisfies split nodes that condition is shown divided to be a node of the document structure in accordance with the document class Determined by the rule application element determination means ,
Based on the matching position on the division pattern dividing position of the content to be split node has is defined by the pattern of the character string to be shown to separators, a content the split node has divided the contents dividing means,
Connecting a plurality of fragment nodes individually having the divided contents to a predetermined position of the first document structure, and generating a subdivided document structure by the content dividing means ;
Defines which type of node in the document structure according to the second document class is converted into the node in the document structure according to the first document class and the node generated by the division. Converting the subdivided document structure according to the structured conversion rule, and creating a second document structure;
A document structure conversion method characterized by the above.

When dividing the content of the division target node, a character string according to a predetermined rule is used as the division pattern, and the content of the division target node at the position of the character string that conforms to the rule indicated by the division pattern Split,
5. The document structure conversion method according to claim 4, wherein:

When dividing the content of the division target node, refer to the deletion information indicating whether or not to delete the character string that matches the division pattern. Do not include a character string that matches the division pattern in the subsequent content.
6. The document structure conversion method according to claim 5, wherein: