JP2005190074A

JP2005190074A - Document dividing device and method, program and index preparing device

Info

Publication number: JP2005190074A
Application number: JP2003429229A
Authority: JP
Inventors: Suefumi Yamada; 季史山田; Shigehisa Kawabe; 惠久川邉
Original assignee: Fuji Xerox Co Ltd
Current assignee: Fujifilm Business Innovation Corp
Priority date: 2003-12-25
Filing date: 2003-12-25
Publication date: 2005-07-14

Abstract

PROBLEM TO BE SOLVED: To provide a document dividing device and method, program and an index preparing device for dividing a structured document including a plurality of unit documents whose contents are different, and to improve document retrieving precision to the structured document including the plurality of unit documents. SOLUTION: When retrieving a structured document including a plurality of unit documents, the structured document is preliminarily divided into every unit documents, and an index is prepared based on the divided division document. The divided document is prepared by a path formula table 46 in which start path formula showing the start point of the unit document and end path formula showing the end point of the unit document are stored, a document structure specifying part 32 for storing tags appearing at the time of reading the structured document, and for specifying the document structure at each tag appearance position, a coincidence determination part 34 for determining whether or not the specified document structure is coincident with path formula stored in the path formula table and a processing execution part 36 for executing processing to newly output the document from the tag appearing position matched with start path formula to the tag appearing position matched with end path formula as another document. COPYRIGHT: (C)2005,JPO&NCIPI

Description

本発明は、内容の異なる複数の単位文書を含む構造化文書を分割する文書分割装置及びその方法、プログラム、インデックス作成装置に関する。 The present invention relates to a document dividing apparatus and method, program, and index creating apparatus for dividing a structured document including a plurality of unit documents having different contents.

収集型の文書検索装置では、予め、検索対象となる文書に含まれる語とその文書の識別情報とを関連付けて記録したインデックスを作成し、このインデックスに基づいて文書検索を行なう。インデックスに記録される文書情報としては、文書の識別情報となる文書ＩＤや文書の所在を示すアドレスなどがある。この文書ＩＤやアドレスは、通常、一つの文書データにつき一つ規定されている。 In the collection-type document search apparatus, an index in which words included in a search target document are associated with the identification information of the document is created in advance, and the document search is performed based on the index. Document information recorded in the index includes a document ID serving as document identification information and an address indicating the location of the document. Normally, one document ID or address is defined for each piece of document data.

特開２０００−２５９６６０号公報JP 2000-259660 A

ところで、文書の中には、一つの文書データの中に複数の話題（トピック）を単位文書として記録したものがある。例えば、Ｗｅｂで提供される電子新聞記事やプログラム言語の電子マニュアルなどである。このような文書の場合、異なる単位文書に登場する語であっても、同一の文書ＩＤ、アドレスなどと関連付けられてインデックスに記録される。このような複数の単位文書を含む文書について、複数の語を検索語として指定するＡＮＤ検索を行なうと、ユーザーの意図に反する検索結果となることがある。 Incidentally, some documents include a plurality of topics (topics) recorded as unit documents in one document data. For example, an electronic newspaper article provided on the Web or an electronic manual in a programming language. In the case of such a document, even words appearing in different unit documents are recorded in the index in association with the same document ID, address, and the like. If an AND search that designates a plurality of words as a search term is performed on a document including a plurality of unit documents, a search result may be contrary to the user's intention.

例えば、電子新聞記事では、「Ａ社の株価」についての新聞記事（単位文書）と「Ｂ社の新製品」についての新聞記事（単位文書）とが同一文書データに記録される場合がある。その場合、インデックスには、「Ａ社」、「株価」、「Ｂ社」、「新製品」などの語は、同じ文書ＩＤと関連付けられて記録される。そして、「Ａ社の新製品」についての新聞記事を検索したいユーザーが、「Ａ社」と「新製品」とを検索語としてＡＮＤ検索をかけると、「Ａ社」と「新製品」とが同じ文書ＩＤ（アドレス）に関連付けられて記録されているため、この文書を検索結果として出力することになる。しかしながら、この文書は、「Ａ社の株価」と「Ｂ社の新製品」についての文書であり、ユーザーの意図する文書ではない。 For example, in an electronic newspaper article, a newspaper article (unit document) for “stock price of company A” and a newspaper article (unit document) for “new product of company B” may be recorded in the same document data. In this case, words such as “Company A”, “Stock Price”, “Company B”, “New Product” are recorded in the index in association with the same document ID. Then, if a user who wants to search for a newspaper article about "A company's new product" performs an AND search using "A company" and "New product" as search terms, "A company" and "New product" Since it is recorded in association with the same document ID (address), this document is output as a search result. However, this document is a document regarding “stock price of company A” and “new product of company B”, and is not a document intended by the user.

このように、一つの文書データが複数の話題についての単位文書を含んでいる場合、検索精度が低下するという問題があった。 As described above, when one document data includes unit documents for a plurality of topics, there is a problem that the search accuracy is lowered.

また、文書データに記録されている文書量が多い場合、検索した語が文書の中のどこに出現するかが分かりづらい。そのため、ユーザーは、検索結果として出力された文書に対して、改めて、検索語の位置を検索する必要があり、煩雑であった。 Further, when the amount of documents recorded in the document data is large, it is difficult to understand where the searched word appears in the document. Therefore, the user needs to search the position of the search word again for the document output as the search result, which is complicated.

ところで、このような文書の多くは、ＨＴＭＬなどの構造化文書として記述されることが多い。構造化文書は、文書本文であるテキストに、文書構造を示すタグなどが組み込まれた文書である。このタグを解析することにより、文書の構造を解析することが可能となる。 By the way, many of such documents are often described as structured documents such as HTML. A structured document is a document in which a tag indicating the document structure is incorporated in the text that is the document body. By analyzing this tag, the structure of the document can be analyzed.

このような構造化文書を取り扱う装置として、特許文献１には、構造化文書を構成する複数の部分構造（例えば、章や段落など）毎に付与された属性名と属性値（部分構造の本文）とを抽出する属性抽出装置が開示されている。これによれば、構造化文書の多用な表現に対して、これら多用な表現の違いを意識する必要がなく、簡易な指定で必要な属性名および属性値を抽出することができる。 As an apparatus for handling such a structured document, Patent Document 1 discloses an attribute name and an attribute value (text of a partial structure) assigned to each of a plurality of partial structures (for example, chapters and paragraphs) constituting the structured document. ) Is extracted. According to this, it is not necessary to be aware of the difference between these various expressions for the various expressions of the structured document, and necessary attribute names and attribute values can be extracted by simple designation.

しかしながら、これは、上述したような複数の単位文書を含む構造化文書に対する検索精度の向上については考慮されていない。 However, this does not take into account improvement in search accuracy for a structured document including a plurality of unit documents as described above.

そこで、本発明では、複数の単位文書を含む構造化文書に対する検索精度を高めることができる文書分割装置およびその方法、プログラム、インデックス作成装置を提供することを目的とする。 Therefore, an object of the present invention is to provide a document segmentation apparatus, a method thereof, a program, and an index creation apparatus that can improve the search accuracy for a structured document including a plurality of unit documents.

本発明の文書分割装置は、複数の単位文書を含み、文書構造を規定するタグと文書の本文であるテキストとからなる構造化文書を分割する文書分割装置であって、単位文書の始点を示す文書構造をタグの配列で表した始点パス式と、単位文書の終点を示す文書構造をタグの配列で表した終点パス式と、を記憶したパス式表と、構造化文書を読み込んで、出現するタグを記憶して、各タグ出現位置での文書構造を特定する文書構造特定手段と、特定された文書構造とパス式表に記憶されたパス式とが一致するかを判別する一致判別手段と、始点パス式と文書構造が一致したタグ出現位置から、終点パス式と文書構造が一致したタグ出現位置まで、の文書を新たに別の文書として出力する処理を実行する処理実行手段と、を有することを特徴とする。 A document dividing apparatus according to the present invention is a document dividing apparatus that includes a plurality of unit documents and divides a structured document including a tag that defines a document structure and text that is the body of the document, and indicates a starting point of the unit document. A path expression table that stores a start path expression that represents the document structure as an array of tags, an end path expression that represents the end structure of a unit document as an array of tags, and a structured document that is read and appears The document structure specifying means for storing the tags to be specified and specifying the document structure at each tag appearance position, and the match determining means for determining whether the specified document structure matches the path expression stored in the path expression table Processing execution means for executing processing for newly outputting a document from the tag appearance position where the start point path expression and the document structure match to the tag appearance position where the end point path expression and the document structure match; It is characterized by having .

好適な態様では、パス式表は、さらに、単位文書の属性を示す文書構造をタグの配列で示した属性パス式を記憶しており、処理実行手段は、新たな文書を出力する際に、属性パス式と文書構造が一致したタグ出現位置に続くテキストを、新たな文書の属性として出力する。別の好適な態様では、パス式表は、さらに、単位文書の出現位置を示す文書構造をタグの配列で示した出現位置パス式を記憶しており、処理実行手段は、新たな文書を出力する場合に、出現位置パス式と文書構造が一致したタグ出現位置に続くテキストを、新たな文書の出現位置として出力することを特徴とする。 In a preferred aspect, the path expression table further stores an attribute path expression indicating the document structure indicating the attribute of the unit document by an array of tags, and the process execution means outputs a new document. The text following the tag appearance position where the attribute path expression matches the document structure is output as the attribute of the new document. In another preferred aspect, the path expression table further stores an appearance position path expression indicating the document structure indicating the appearance position of the unit document by an array of tags, and the process execution means outputs a new document. In this case, the text following the tag appearance position where the appearance position path expression matches the document structure is output as the appearance position of a new document.

ここで構造化文書としては、例えば、ＨＴＭＬ文書やＸＭＬ文書のように、タグによりその文書の文書構造が規定されている文書をいう。また、単位文書とは、一つの話題（トピック）について記述された文書である。例えば、電子新聞記事における一つの事件記事の文書や電子マニュアルにおける一つの事柄に関する文書などが挙げられる。 Here, the structured document refers to a document in which the document structure of the document is defined by a tag, such as an HTML document or an XML document. The unit document is a document describing one topic (topic). For example, one case article document in an electronic newspaper article or one matter document in an electronic manual can be cited.

本発明によれば、構造化文書を単位文書ごとに分割することができる。したがって、単位文書ごとに分割された文書に基づいてインデックスを作成すれば、より高い精度で文書検索を行なうことができる。 According to the present invention, a structured document can be divided into unit documents. Therefore, if an index is created based on a document divided for each unit document, a document search can be performed with higher accuracy.

以下、本発明の実施の形態であるインデックス作成装置について図を参照して説明する。 Hereinafter, an index creating apparatus according to an embodiment of the present invention will be described with reference to the drawings.

図１は、本発明の実施の形態であるインデックス作成装置１０のハードウェア構成を示すブロック図である。インデックス作成装置１０は、各部を制御する機能を有する中央処理装置（以下、「ＣＰＵ」という）１２、ＲＯＭやＲＡＭ等で構成されたメモリ１４、検索対象の構造化文書及びその文書情報等を記憶するハードディスク１６、キーボードやマウス等で検索条件や種々の指示を与える入力部２０、ＣＲＴまたは液晶ディスプレイ等で構成され検索結果等を表示する表示部２２、フレキシブルディスクに対するデータの読み書きを行なうフレキシブルディスクドライブ（ＦＤＤ）２４、ＣＤ−ＲＯＭからのデータの読み出しを行なうＣＤ−ＲＯＭドライブ２６、他の通信装置と信号及びデータを遣り取りするための通信部１８等をそれぞれバスによって接続して構成されている。 FIG. 1 is a block diagram showing a hardware configuration of an index creating apparatus 10 according to an embodiment of the present invention. The index creation device 10 stores a central processing unit (hereinafter referred to as “CPU”) 12 having a function of controlling each unit, a memory 14 composed of a ROM, a RAM, and the like, a structured document to be searched, document information thereof, and the like. A hard disk 16 for input, an input unit 20 for giving search conditions and various instructions with a keyboard, a mouse, etc., a display unit 22 for displaying a search result, etc. composed of a CRT or a liquid crystal display, etc. A (FDD) 24, a CD-ROM drive 26 for reading data from a CD-ROM, a communication unit 18 for exchanging signals and data with other communication devices, and the like are connected by buses.

図２に、インデックス作成装置１０の機能構成を示すブロック図を示す。インデックス作成装置１０は、文書読込部３０、文書構造特定部３２、一致判定部３４、処理実行部３６、インデックス作成部３８を有し、それぞれＣＰＵ１２により制御される。また、検索対象となる構造化文書を格納する文書格納部４０、一時記憶部４２、分割文書格納部４４などがハードディスク１６または通信部１８を介して通信される他のコンピュータのハードディスクに設けられている。 FIG. 2 is a block diagram showing a functional configuration of the index creation device 10. The index creating apparatus 10 includes a document reading unit 30, a document structure specifying unit 32, a coincidence determining unit 34, a process executing unit 36, and an index creating unit 38, and each is controlled by the CPU 12. In addition, a document storage unit 40 for storing a structured document to be searched, a temporary storage unit 42, a divided document storage unit 44, and the like are provided on a hard disk of another computer communicated via the hard disk 16 or the communication unit 18. Yes.

文書格納部４０には、検索対象となる多数の構造化文書が格納されている。ここで、構造化文書とは、例えば、ＨＴＭＬ文書やＸＭＬ文書のようにタグでその文書構造が表された文書である。 The document storage unit 40 stores a large number of structured documents to be searched. Here, the structured document is a document in which the document structure is represented by a tag, for example, an HTML document or an XML document.

格納されている構造化文書は、文書読込部３０により、適宜、読み込まれる。文書読込部３０は、構造化文書を読み込むもので、その読み込んだ文を文書構造特定部３２に出力する。文書構造特定部３２は、文書読込部３０から入力された文中のタグの有無を判定し、タグが存在する場合は、さらに、そのタグの表す文書構造を特定する。特定された文書構造はパス式として一致判定部３４に出力される。 The stored structured document is appropriately read by the document reading unit 30. The document reading unit 30 reads a structured document and outputs the read sentence to the document structure specifying unit 32. The document structure specifying unit 32 determines the presence or absence of a tag in the sentence input from the document reading unit 30, and if a tag is present, further specifies the document structure represented by the tag. The identified document structure is output to the coincidence determination unit 34 as a path expression.

ここで、パス式とは、文書構造をパスの配列として表したもので、その部分の文書構造を規定するタグを順番に並べたものである。例えば、図３に示すＨＴＭＬ文書の場合、２行目の＜ＢＯＤＹ＞での文書構造をパス式で表すと「ＨＴＭＬ／ＢＯＤＹ」となる。また、４行目の＜Ｂ＞での文書構造は、「ＨＴＭＬ／ＢＯＤＹ／Ｂ」となる。これは、＜Ｂ＞の部分における文書構造が、＜ＨＴＭＬ＞、＜ＢＯＤＹ＞、＜Ｂ＞の３つのタグで規定されていることを表している。 Here, the path expression is an expression of the document structure as an array of paths, in which tags defining the document structure of the portion are arranged in order. For example, in the case of the HTML document shown in FIG. 3, if the document structure at <BODY> on the second line is expressed by a path expression, “HTML / BODY” is obtained. Also, the document structure of on the fourth line is “HTML / BODY / B”. This indicates that the document structure in the portion is defined by three tags <HTML>, <BODY>, and .

一致判定部３４は、文書構造特定部３２から入力されたパス式と後述するパス式表４６に記録されたパス式とが一致するかを判定する。判定結果は、処理実行部３６に出力される。処理実行部３６では、判定結果およびパス式表４６に基づいて種々の処理を実行する。実行される処理の内容は、パス式表４６に記録されており、文字列の一時記憶や一時記憶した文字列を分割文書として出力する、などの処理がある。文字列の一時記憶は一時記憶部４２に、分割文書の出力は分割文書格納部４４になされる。 The coincidence determination unit 34 determines whether a path expression input from the document structure specifying unit 32 matches a path expression recorded in a path expression table 46 described later. The determination result is output to the process execution unit 36. The process execution unit 36 executes various processes based on the determination result and the path expression table 46. The contents of the process to be executed are recorded in the path expression table 46, and there are processes such as temporarily storing the character string and outputting the temporarily stored character string as a divided document. The temporary storage of the character string is performed in the temporary storage unit 42, and the output of the divided document is performed in the divided document storage unit 44.

インデックス作成部３８は、分割文書格納部４４を参照し、各分割文書に含まれる語とその文書の識別情報とを関連付けたインデックスを作成する。このインデックス作成部３８での処理は、従来からあるインデックス作成の技術を用いることができる。 The index creation unit 38 refers to the divided document storage unit 44 and creates an index that associates the words included in each divided document with the identification information of the document. The processing in the index creation unit 38 can use a conventional index creation technique.

次に、このインデックス作成装置１０で、重要となる文書分割の処理について説明する。図４に分割対象となる構造化文書（ＨＴＭＬ文書）の一例を、図５にその構造化文書の表示例を示す。なお、図４において、説明のために行数を左側に示しているが、実際の構造化文書には、このような行数はない。 Next, an important document dividing process in the index creating apparatus 10 will be described. FIG. 4 shows an example of a structured document (HTML document) to be divided, and FIG. 5 shows a display example of the structured document. In FIG. 4, the number of lines is shown on the left side for the sake of explanation, but there is no such number of lines in an actual structured document.

ここでは、分割対象の構造化文書としてプログラム言語の電子マニュアルを一例としてあげる。電子マニュアルは、図４に示すようなＨＴＭＬ文書で記述されており、文書格納部４０に格納されている。なお、図４は、Ｓｔｒｅａｍクラスに属するｒｅａｄ（）メソッド、および、ｃｌｏｓｅ（）メソッドを説明する文書を示しているが、電子マニュアルにはこの他、様々なクラス、メソッドなどを説明する構造化文書を多数有している。また、多数の構造化文書は、全て、ほぼ同様の文書構造を呈している。 Here, an electronic manual of a programming language is taken as an example as a structured document to be divided. The electronic manual is described in an HTML document as shown in FIG. 4 and is stored in the document storage unit 40. FIG. 4 shows a document explaining the read () method and the close () method belonging to the Stream class, but the electronic manual also describes a structured document explaining various classes and methods. It has a lot. All the structured documents have almost the same document structure.

図４で示すＨＴＭＬ文書は、Ｗｅｂブラウザで表示した場合、図５に示すような表示となる。図５から明らかなように、この文書は、表題（単位文書Ｔ１）、Ｓｔｒｅａｍクラスの説明（単位文書Ｔ２）、ｒｅａｄ（）メソッドの説明（単位文書Ｔ３）、ｃｌｏｓｅ（）メソッドの説明（単位文書Ｔ４），の４つの単位文書（話題）が記録されている。また、各単位文書は、区切り線５０で区切られている。 The HTML document shown in FIG. 4 is displayed as shown in FIG. 5 when displayed on a Web browser. As is apparent from FIG. 5, this document includes a title (unit document T1), a stream class description (unit document T2), a read () method description (unit document T3), and a close () method description (unit document). T4), four unit documents (topics) are recorded. Each unit document is divided by a dividing line 50.

ここで、例えば、単位文書Ｔ３において語「Ａ」が、単位文書Ｔ４において語「Ｂ」が出現する場合、すなわち、ｒｅａｄ（）メソッドは語「Ａ」に、ｃｌｏｓｅ（）メソッドは語「Ｂ」にのみ関連する場合、従来のインデックスの作成方法では次のような問題がある。例えば、ユーザーが語「Ａ」、「Ｂ」両方に関連するメソッドを検索しようとして、「Ａ」と「Ｂ」とでＡＮＤ検索を行なうと、この図４の文書も検索結果として出力される。これは、図４の文書は、語「Ａ」、語「Ｂ」のいずれも含むからである。しかし、実際には、ｒｅａｄ（）メソッドは語「Ｂ」とは関連なく、ｃｌｏｓｅ（）メソッドは語「Ａ」と関連しない。したがって、この図４の文書は、ユーザーの意図と異なる文書である。 Here, for example, when the word “A” appears in the unit document T3 and the word “B” appears in the unit document T4, that is, the read () method is the word “A”, and the close () method is the word “B”. When related only to the above, the conventional index creation method has the following problems. For example, if the user searches for a method related to both the words “A” and “B” and performs an AND search with “A” and “B”, the document of FIG. 4 is also output as a search result. This is because the document of FIG. 4 includes both the word “A” and the word “B”. In practice, however, the read () method is not associated with the word “B” and the close () method is not associated with the word “A”. Therefore, the document of FIG. 4 is a document different from the user's intention.

また、図４では、各メソッドの説明文は省略しているため、全体の文書量が少なく、Ｗｅｂブラウザで表示しても全文書を一度に見渡すことができる。しかしながら、実際の電子マニュアルは、説明文がある程度長く、文書量の多いものとなっている。したがって、例えば、語「Ｂ」を検索して、この文書が検索結果として出力されても、その語「Ｂ」がどこに出現するかを即座に発見することができない。そのため、ユーザーは、検索結果として出力された文書に対して、更に、語の出現位置を検索するなどの処理が必要となり、煩雑であった。 Also, in FIG. 4, the explanation of each method is omitted, so that the total amount of documents is small, and all the documents can be viewed at once even if they are displayed on a Web browser. However, an actual electronic manual has a long explanation and a large amount of documents. Therefore, for example, even if the word “B” is searched and this document is output as a search result, it cannot be immediately found where the word “B” appears. For this reason, the user needs to perform processing such as searching for the appearance position of a word for the document output as the search result, which is complicated.

そこで、本実施の形態では、以下に説明するように文書を単位文書毎に分割する処理を行なう。図６に文書分割の処理のフローチャートを示す。 Therefore, in the present embodiment, processing for dividing a document into unit documents is performed as described below. FIG. 6 shows a flowchart of document division processing.

文書分割を行なう場合は、まず、文書読込部３０により、格納されている構造化文書を取り出し、文字列の読み込みを行なう（Ｓ１０）。次に、文書構造特定部３２により、読み込んだ文字列にタグが含まれているかが判断される（Ｓ１２）。タグの有無は、「＜」や「＞」の有無で判断することができる。 When document division is performed, first, the document reading unit 30 extracts a stored structured document and reads a character string (S10). Next, the document structure specifying unit 32 determines whether a tag is included in the read character string (S12). The presence or absence of a tag can be determined by the presence or absence of “<” or “>”.

タグがあった場合は、そのタグをメモリに記憶しておく（Ｓ１４）。すでにメモリにタグが記憶されている場合は、それに追加して記憶するスタック構造をとる。そして、読み込んだタグの出現位置での文書構造を、メモリ内に記憶されているタグのスタックから特定する（Ｓ１６）。特定された文書構造は、タグの配列であるパス式として表される。パス式は、その部分の文書構造を規定するタグを順に羅列したものである。 If there is a tag, the tag is stored in the memory (S14). If a tag has already been stored in the memory, a stack structure for storing the tag is added. Then, the document structure at the appearance position of the read tag is specified from the tag stack stored in the memory (S16). The identified document structure is represented as a path expression that is an array of tags. The path expression is an enumeration of tags that define the document structure of the part.

例えば、読み込んだ文が１行目であれば、メモリ内には、タグ「ＨＴＭＬ」のみがスタックされている。したがって、１行目における文書構造は、パス式で表せば「ＨＴＭＬ」となる。一方、読み込んだ文が２行目であれば、メモリ内には、タグ「ＨＴＭＬ」とタグ「ＢＯＤＹ」とがスタックされている。したがって、２行目における文書構造は、「ＨＴＭＬ／ＢＯＤＹ」というパス式で表すことができる。 For example, if the read sentence is the first line, only the tag “HTML” is stacked in the memory. Therefore, the document structure in the first line is “HTML” when expressed by a path expression. On the other hand, if the read sentence is the second line, the tag “HTML” and the tag “BODY” are stacked in the memory. Therefore, the document structure in the second line can be expressed by a path expression “HTML / BODY”.

文書構造をパス式として特定すれば、次に、パス式表に記録されたパス式とこの特定されたパス式とが一致するかを判別する（Ｓ１８）。ここでパス式表とは、図７、図８に示すように、所定の文書構造を示すパス式５２と、そのパス式で示される文書構造での処理５４とを関連付けて記録したものである。このパス式表には、少なくとも、各単位文書の始点を示す文書構造を示すパス式Ｐ１，Ｐ２と各単位文書の終点を示す文書構造のパス式Ｐ３とが記録されており、それぞれに対応する処理が関連付けられて記録されている。 If the document structure is specified as a path expression, it is next determined whether or not the path expression recorded in the path expression table matches the specified path expression (S18). Here, as shown in FIGS. 7 and 8, the path expression table is a table in which a path expression 52 indicating a predetermined document structure and a process 54 based on the document structure indicated by the path expression are recorded in association with each other. . In this path formula table, at least path formulas P1 and P2 indicating the document structure indicating the start point of each unit document and a path formula P3 of the document structure indicating the end point of each unit document are recorded. Processes are associated and recorded.

例えば、図４に示す文書の場合、単位文書Ｔ１は、２行目、すなわち「ＨＴＭＬ／ＢＯＤＹ」で示される文書構造が始点となる。また、単位文書Ｔ２は、６行目、すなわち、「ＨＴＭＬ／ＢＯＤＹ／ＤＬ／ＤＴ／Ｂ」で示される文書構造が始点となる。パス式表には、この各単位文書の始点を示すパス式Ｐ１，Ｐ２に、「以降の文字列を一時記憶」という処理が関連付けられている。 For example, in the case of the document shown in FIG. 4, the unit document T1 starts from the second line, that is, the document structure indicated by “HTML / BODY”. The unit document T2 starts from the sixth line, that is, the document structure indicated by “HTML / BODY / DL / DT / B”. In the path expression table, a process of “temporarily storing a subsequent character string” is associated with the path expressions P1 and P2 indicating the start point of each unit document.

また、各単位文書は、区切り線を示す「ＨＴＭＬ／ＢＯＤＹ／ＨＲ」が終点とっている（図４の５行目、８行目など）。パス式表には、この各単位文書の終点を示すパス式Ｐ３「ＨＴＭＬ／ＢＯＤＹ／ＨＲ」に、「一時記憶した文字列を別文書データとして出力」という処理が関連付けられている。 Each unit document has “HTML / BODY / HR” indicating a delimiter as an end point (the fifth line, the eighth line, etc. in FIG. 4). In the path formula table, a process of “outputting the temporarily stored character string as separate document data” is associated with the path formula P3 “HTML / BODY / HR” indicating the end point of each unit document.

一致判定部３４は、このパス式表に記録されたパス式と文書構造特定部３２で特定されたパス式とが一致するかを判別し（Ｓ１８）、判別結果を処理実行部３６に出力する。 The coincidence determination unit 34 determines whether the path expression recorded in the path expression table matches the path expression specified by the document structure specifying unit 32 (S18), and outputs the determination result to the process execution unit 36. .

処理実行部３６は、パス式が一致する場合は、パス式表に基づいて、そのパス式に関連付けられた処理を実行する（Ｓ２０）。例えば、図４の文書の２行目を読み込んだ場合、その文書構造は「ＨＴＭＬ／ＢＯＤＹ」というパス式で表される。このパス式は、パス式表に記録されたパス式Ｐ２と一致する。この場合は、パス式Ｐ２に関連付けられた処理、「以降の文字列を一時記憶」を実行する。すなわち、以降、読み込まれる文字列を一時記憶部４２に記憶する。図４の例でいえば、３行目以降で読み込まれる文字列、「＜ＡＨＲＥＦ＝“＃ｒｅａｄ（）”＞ｒｅａｄ（）＜／Ａ＞・・・・」を一時記憶部４２に一時的に記憶する。なお、ここでは、読み込んだ文字列全て、すなわちタグを含んだ文字列全てを一時記憶しているが、タグを除いたテキスト部分（”ｒｅａｄ（）”、”ｃｌｏｓｅ（）”）だけを一時記憶してもよい。 If the path expressions match, the process execution unit 36 executes a process associated with the path expression based on the path expression table (S20). For example, when the second line of the document in FIG. 4 is read, the document structure is represented by a path expression “HTML / BODY”. This path formula matches the path formula P2 recorded in the path formula table. In this case, the process associated with the path expression P2, “temporarily store the subsequent character string” is executed. That is, thereafter, the character string to be read is stored in the temporary storage unit 42. In the example of FIG. 4, the character string “<A HREF=“#read( )”> read () </A>...” Read in the third and subsequent lines is temporarily stored in the temporary storage unit 42. To remember. Here, all the read character strings, that is, all the character strings including the tag are temporarily stored, but only the text portion excluding the tag (“read ()”, “close ()”) is temporarily stored. May be.

次に、読み込んだ文が構造化文書の終わりであるかを判断する（Ｓ２２）。終わりでなく、まだ、文が続く場合は、文字列の読み込みを続ける。そして、タグの出現の有無やパス式との一致判断、処理実行を行なう。そして、特定されたパス式が、パス式表に記録されたパス式と一致する場合、パス式表に基づいて処理を実行する。 Next, it is determined whether the read sentence is the end of the structured document (S22). If the sentence continues, not the end, continue reading the string. Then, the presence / absence of the appearance of the tag, the matching with the path expression, and the process execution are performed. If the identified path expression matches the path expression recorded in the path expression table, the process is executed based on the path expression table.

例えば、５行目を読み込んだ場合、５行目における文書構造は「ＨＴＭＬ／ＢＯＤＹ／ＨＲ」というパス式で表される。これは、パス式表に記録されたパス式Ｐ３に一致する。パス式Ｐ３は、「一時記録された文字列を別文書データとして出力」という処理に関連付けられているため、処理実行部３６は、一時記憶部４２に記憶されている文字列を別の新たな文書データである分割文書として分割文書格納部４４に出力する。このとき、出力される分割文書には、その元となる文書データ（図４の文書データ）の文書ＩＤやアドレスなども記録しておく。 For example, when the fifth line is read, the document structure in the fifth line is represented by a path expression “HTML / BODY / HR”. This coincides with the path formula P3 recorded in the path formula table. Since the path expression P3 is associated with the process “output the temporarily recorded character string as separate document data”, the process execution unit 36 replaces the character string stored in the temporary storage unit 42 with another new character string. The document data is output to the divided document storage unit 44 as a divided document. At this time, the document ID and address of the original document data (document data in FIG. 4) are also recorded in the output divided document.

このような処理を構造化文書の全文に対して行い、文書の最後になれば、文書分割処理は終了となる（Ｓ２２）。 Such processing is performed on the entire text of the structured document, and when the end of the document is reached, the document division processing ends (S22).

上述したように、パス式表には、各単位文書の始点を示すパス式Ｐ１，Ｐ２には「以降の文字列を一時記憶」という処理が、各単位文書の終点を示すパス式Ｐ３には、「一時記憶した文字列を別文書データとして出力」する処理が関連付けられている。そして、そのパス式表に基づいて処理を行なうことにより、構造化文書を各単位文書毎の別の文書データ（分割文書）として出力することができる。 As described above, in the path formula table, the path formulas P1 and P2 indicating the start point of each unit document have a process of “temporarily storing the subsequent character strings”, and the path formula P3 indicating the end point of each unit document is included in the path formula table. , “A temporarily stored character string is output as separate document data” is associated. Then, by performing processing based on the path expression table, the structured document can be output as separate document data (divided document) for each unit document.

構造化文書を各単位文書毎の分割文書として出力できると次のような利点がある。例えば、インデックスの作成を、この分割文書に基づいて行なうことにより、ＡＮＤ検索における誤検索を低減できる。すなわち、各分割文書に含まれる語とその文書情報（文書ＩＤなど）およびその分割文書の元となった構造化文書の文書情報である元文書情報（元文書ＩＤ、元文書アドレスなど）を関連付けたインデックスを作成する。このインデックスは、同じ構造化文書に含まれる語であっても、異なる単位文書に含まれる語は異なる文書ＩＤと関連付けられている。したがって、このインデックスに基づいて検索を行なえば、ＡＮＤ検索における誤検索を防止できる。なお、この場合、実際の検索結果としては、元文書（構造化文書）を検索結果として出力する。 If a structured document can be output as a divided document for each unit document, there are the following advantages. For example, by creating an index based on this divided document, erroneous search in AND search can be reduced. That is, the word included in each divided document is associated with the document information (document ID, etc.) and the original document information (original document ID, original document address, etc.) that is the document information of the structured document that is the source of the divided document. Create an index. In this index, even if the words are included in the same structured document, the words included in different unit documents are associated with different document IDs. Therefore, if a search is performed based on this index, an erroneous search in an AND search can be prevented. In this case, as the actual search result, the original document (structured document) is output as the search result.

また、本実施の形態で用いるパス式表には、単位文書の始点、終点を示すパス式Ｐ１，Ｐ２，Ｐ３の他に、属性を示すパス式Ｐ６（図８参照）、単位文書の出現位置を示すパス式Ｐ７（図８参照）、文書構成の変更を示すパス式Ｐ４（図７参照）などが記録されている。これらについて説明する。 Further, the path expression table used in the present embodiment includes a path expression P6 (see FIG. 8) indicating attributes in addition to the path expressions P1, P2, and P3 indicating the start and end points of the unit document, and the appearance position of the unit document. A path expression P7 (see FIG. 8) indicating the document structure, a path expression P4 (see FIG. 7) indicating the change of the document structure, and the like are recorded. These will be described.

図４に示す構造化文書は、９行目以降で、その文書構成が大きく変わる。すなわち、８行目までは、クラスの説明を行なうための文書構成であったが、９行目からはメソッドの説明のための文書構成となっている。 The structure of the structured document shown in FIG. 4 greatly changes after the ninth line. That is, up to the eighth line is a document structure for explaining the class, but from the ninth line is a document structure for explaining the method.

このような場合、８行目以前と９行目以降とで同じパス式表を用いずに、新たに別のパス式表を用いたほうがよい。そこで、パス式表に、このような文書構成が大きく変更する部分の文書構造を示すパス式Ｐ４を記録し、このパス式Ｐ４に「パス式表の変更」という処理を関連付けておく。そして、９行目以降の文書構成に適した第２パス式表（図８）を用意し、９行目以降では、この第２パス式表に基づいて処理を行なうようにする。 In such a case, it is better to use a different path expression table instead of the same path expression table before the eighth line and after the ninth line. Therefore, a path formula P4 indicating the document structure of such a part that greatly changes the document structure is recorded in the path formula table, and a process called “change path formula table” is associated with the path formula P4. Then, a second path expression table (FIG. 8) suitable for the document structure on the ninth and subsequent lines is prepared, and processing is performed on the ninth and subsequent lines based on the second path expression table.

また、各単位文書には、その単位文書の属性となる語やその単位文書の出現位置を示す語が含まれている場合がある。そのような語の出現を示すパス式をパス式表に記録しておき、属性や出現位置を一時記憶できるようにしてもよい。 Each unit document may include a word that is an attribute of the unit document and a word that indicates the appearance position of the unit document. A path expression indicating the appearance of such a word may be recorded in a path expression table so that attributes and appearance positions can be temporarily stored.

例えば、図４に示す構造化文書では、１５行目の「ｒｅａｄ（）」という語は単位文書Ｔ３の出現位置を示す語である。すなわち、周知のように＜ＡＮＡＭＥ＝”（キーワード）”／＞というタグは、その構造化文書のその部分へのリンクを設定するためのキーワードを設定するためのタグである。そして、＜ＡＨＥＲＦ＝”（本来の文書のＵＲＬ）＃キーワード”＞と記述してリンクを設定すると、そのリンクによってその文書のそのキーワード出現位置に移動することができる。したがって、例えば、図４の構造化文書のＵＲＬが”ｓａｍｐｌｅ．ｈｔｍｌ”であった場合、＜ＡＨＥＲＦ＝”ｓａｍｐｌｅ．ｈｔｍｌ＃ｒｅａｄ（）”＞としてリンクを設定すれば、この構造化文書の１５行目、すなわち、単位文書Ｔ３の先頭位置に移動することができる。 For example, in the structured document shown in FIG. 4, the word “read ()” on the 15th line is a word indicating the appearance position of the unit document T3. That is, as is well known, the tag <A NAME="(keyword)"/> is a tag for setting a keyword for setting a link to that part of the structured document. When a link is set by describing <A HERF=“(original document URL)#keyword”>, the link can move to the keyword appearance position of the document. Therefore, for example, when the URL of the structured document in FIG. 4 is “sample.html”, if a link is set as <A HERF=“sample.html#read( )”>, the structured document 15 It is possible to move to the line, that is, the top position of the unit document T3.

そこで、第２パス式表に、この単位文書の出現位置を示すパス式Ｐ７を記録しておき、このパス式Ｐ７に続く文字列を単位文書の出現位置として一時記憶する処理を関連付けておく。そうすることで、ｒｅａｄ（）が単位文書Ｔ３の出現位置として一時記憶される。そして、単位文書の終点を示すパス式Ｐ９に相当する文書構造部分を読み込んだ際には、この出現位置を示す語も含む、一時記憶された内容が分割文書として出力される。 Therefore, a path expression P7 indicating the appearance position of the unit document is recorded in the second path expression table, and a process of temporarily storing the character string following the path expression P7 as the appearance position of the unit document is associated. By doing so, read () is temporarily stored as the appearance position of the unit document T3. When the document structure portion corresponding to the path expression P9 indicating the end point of the unit document is read, the temporarily stored content including the word indicating the appearance position is output as a divided document.

インデックスを作成する際には、各分割文書に含まれる語に、この出現位置を示す語も関連付けて記憶する。そして、検索実行の際に、該当する検索語に出現位置が関連付けられている場合、検索結果の表示はその出現位置へ移動した状態で表示する。これにより、文書量の多い文書の検索を行った場合でも、検索結果として、その検索語周辺を表示することができる。そのため、ユーザーは、検索結果に対して、改めて、検索語の出現位置検索をする必要が無く、簡易に所望の情報を得ることができる。 When creating an index, a word indicating the appearance position is stored in association with a word included in each divided document. When the search position is associated with the corresponding search term when the search is executed, the search result is displayed in a state of being moved to the appearance position. Thereby, even when a document with a large amount of documents is searched, it is possible to display the vicinity of the search word as a search result. Therefore, the user does not need to search for the appearance position of the search word again with respect to the search result, and can easily obtain desired information.

また、各単位文書には、その単位文書の属性を示す語が含まれている場合がある。図４に示す構造化文書では、単位文書Ｔ３の中には、”ｒｅａｄ．ｈｔｍ”へのリンクが設定されている（１９行目）。このリンク先を単位文書Ｔ３の属性として設定することができる。この場合は、やはり、第２パス式表に、この属性を示すパス式Ｐ６と「文字列を属性として一時記憶」という処理とを関連付けて記録しておく。そうすることで、”ｒｅａｄ．ｈｔｍ”が単位文書Ｔ３の属性として抽出できる。そして、インデックスには、この属性も関連付けて記憶しておく。 Each unit document may include a word indicating the attribute of the unit document. In the structured document shown in FIG. 4, a link to “read.htm” is set in the unit document T3 (19th line). This link destination can be set as an attribute of the unit document T3. In this case, the path expression P6 indicating this attribute and the process of “temporarily storing a character string as an attribute” are recorded in the second path expression table in association with each other. By doing so, “read.htm” can be extracted as an attribute of the unit document T3. This attribute is also associated with the index and stored.

これにより、検索の際には、検索語だけでなく、属性（本実施の形態では、どこにリンクされているか）でも検索することができる。これにより、より多用な検索を行なうことができる。なお、本実施の形態では、リンク先を属性として抽出しているが、当然、他の内容を属性として抽出してもよい。 Thereby, in the search, it is possible to search not only by a search word but also by an attribute (where it is linked in the present embodiment). Thereby, more versatile search can be performed. In the present embodiment, the link destination is extracted as an attribute, but naturally other contents may be extracted as the attribute.

以上、説明したように、本実施の形態では、単位文書毎に構造化文書を分割することができる。そして、その分割された分割文書に基づいてインデックスを作成することにより、複数の単位文書を含んだ構造化文書に対応した検索結果を提供することができる。すなわち、より精度の高い文書検索を行なうことができる。また、分割文書に、各単位文書の出現位置や属性を記憶し、インデックスにそれを反映させることにより、より操作性のよい文書検索を行なうことができる。 As described above, in this embodiment, a structured document can be divided for each unit document. Then, by creating an index based on the divided divided document, it is possible to provide a search result corresponding to a structured document including a plurality of unit documents. That is, a document search with higher accuracy can be performed. Further, by storing the appearance position and attribute of each unit document in the divided document and reflecting it in the index, it is possible to perform a document search with better operability.

なお、本実施の形態では、文書分割の対象として、プログラム言語の電子マニュアルを例として説明したが、複数の単位文書を含む構造化文書であれば、他の構造化文書を対象としてもよい。また、パス式表も分割対象となる文書の構成に応じて適宜変更してもよい。例えば、本実施の形態では、区切り線（＜ＨＲ／＞）で単位文書が区切られる構造化文書を分割対象としているが、他の終点を示すタグ、例えば、改行（＜ＢＲ＞）などを終点パス式として設定するなどしてもよい。 In the present embodiment, an electronic manual in a programming language has been described as an example of the document division target. However, as long as it is a structured document including a plurality of unit documents, another structured document may be the target. Further, the path expression table may be appropriately changed according to the configuration of the document to be divided. For example, in the present embodiment, a structured document in which a unit document is divided by a dividing line (<HR />) is a target of division. It may be set as a path expression.

本発明の実施の形態であるインデックス作成装置のハードウェア構成を示すブロック図である。It is a block diagram which shows the hardware constitutions of the index production apparatus which is embodiment of this invention. インデックス作成装置の機能構成を示すブロック図である。It is a block diagram which shows the function structure of an index production apparatus. 構造化文書の一例を示す図である。It is a figure which shows an example of a structured document. 分割対象の構造化文書の一例を示す図である。It is a figure which shows an example of the structured document of a division | segmentation object. 図４の構造化文書の表示を示す図である。It is a figure which shows the display of the structured document of FIG. 文書分割の処理を示すフローチャートである。It is a flowchart which shows the process of document division. パス式表の一例を示す図である。It is a figure which shows an example of a path | pass type | formula table. パス式表の他の一例を示す図である。It is a figure which shows another example of a path | pass type | formula table | surface.

Explanation of symbols

１０インデックス作成装置、３０文書読込部、３２文書構造特定部、３４一致判定部、３６処理実行部、３８インデックス作成部、４０文書格納部、４２一時記憶部、４４分割文書格納部、４６パス式表。 DESCRIPTION OF SYMBOLS 10 Index production device, 30 Document reading part, 32 Document structure specification part, 34 Match determination part, 36 Process execution part, 38 Index creation part, 40 Document storage part, 42 Temporary storage part, 44 Division | segmentation document storage part, 46 Path type table.

Claims

A document dividing apparatus that includes a plurality of unit documents and divides a structured document including a tag that defines a document structure and a text that is a body of the document,
A path expression table storing a start point path expression that represents the document structure indicating the start point of the unit document with an array of tags, and an end point path expression that represents the document structure indicating the end point of the unit document with an array of tags;
Document structure specifying means for reading a structured document, storing tags that appear, and specifying the document structure at each tag appearance position;
A match determination means for determining whether the identified document structure matches the document structure indicated by the path expression stored in the path expression table;
Processing execution means for executing processing for newly outputting a document from the tag appearance position that matches the document structure indicated by the start point path expression to the tag appearance position that matches the document structure indicated by the end point path expression;
A document dividing apparatus characterized by comprising:

The document dividing device according to claim 1,
The path expression table further stores an attribute path expression that represents a document structure indicating an attribute of a unit document by an array of tags.
A document dividing apparatus, wherein when a new document is output, the process execution means outputs the text following the tag appearance position that matches the document structure indicated by the attribute path expression as an attribute of the new document.

The document dividing device according to claim 1 or 2,
The path expression table further stores an appearance position path expression in which the document structure indicating the appearance position of the unit document is represented by an array of tags.
The process executing unit outputs a text following a tag appearance position that matches the document structure indicated by the appearance position path expression as a new document appearance position when outputting a new document. .

A computer program for causing a computer to function as the document dividing device according to any one of claims 1 to 3.

A document dividing method that includes a plurality of unit documents and divides a structured document including a tag that defines a document structure and a text that is a body of the document,
The path expression creation means stores a start point path expression that represents the document structure indicating the start point of the unit document as an array of tags and an end point path expression that represents the document structure indicating the end point of the unit document as an array of tags. Creating a table;
A document structure specifying unit that reads a structured document, stores tags that appear, and specifies a document structure at each tag appearance position; and
A match determination step in which the match determination means determines whether the identified document structure matches the path expression stored in the path expression table;
Process execution means for executing a process of newly outputting a document from the tag appearance position where the document structure matches the start point path expression to the tag appearance position where the document structure matches the end path expression as another document Process,
A document dividing method characterized by comprising:

An index creation apparatus that creates an index in which words included in each document output by the document dividing apparatus according to any one of claims 1 to 3 and identification information of the document are stored in association with each other.