JP2009129013A

JP2009129013A - Method, device, and program for retrieving document

Info

Publication number: JP2009129013A
Application number: JP2007300750A
Authority: JP
Inventors: Hiroki Tanioka; 広樹谷岡
Original assignee: JustSystems Corp
Current assignee: JustSystems Corp
Priority date: 2007-11-20
Filing date: 2007-11-20
Publication date: 2009-06-11

Abstract

<P>PROBLEM TO BE SOLVED: To retrieve appropriate elements from a hierarchically structured document set in a short time. <P>SOLUTION: An acquisition part 201 acquires an XML document. A generation part 202 generates a node list including positional information. An input part 203 receives input of retrieval conditions. A calculation part 204 calculates a score showing the degree of agreement of the retrieval conditions. A specification part 205 refers to the positional information to specify a parent node to which a node belongs. An addition part 206 adds an addition value based on the score of the node to the score of the parent node specified by the specification part 205. Based on the score after the addition by the addition part 206, a selection part 207 selects a node to be output as a retrieval result from the nodes listed in the node list. An output part 208 outputs the node selected by the selection part 207. <P>COPYRIGHT: (C)2009,JPO&INPIT

Description

この発明は、階層構造化された文書セットから、任意の検索条件に合致するノードを検索する文書検索方法、文書検索装置および文書検索プログラムに関する。 The present invention relates to a document search method, a document search apparatus, and a document search program for searching a node that matches an arbitrary search condition from a hierarchically structured document set.

従来より、一連の文書を、章や節などの適切な単位の部分文書に分割し、階層構造化することによって、出版の多様性、情報検索の的確性、分割／結合の容易性など、多くのメリットを得ることができるものとされている。たとえば、階層構造化された文書の代表的なものとしては、ＸＭＬ文書が挙げられる。ＸＭＬ文書は、タグによって文書要素や文書テキストがマークアップされているため、部分文書の検索を容易におこなうことができるものとされている（たとえば、下記特許文献１参照。）。 Conventionally, by dividing a series of documents into sub-documents of appropriate units such as chapters and sections and making them into a hierarchical structure, there are many publishing diversity, accuracy of information retrieval, ease of division / combination, etc. It is supposed that the merit of can be obtained. For example, an XML document is a typical example of a hierarchically structured document. An XML document has a document element or document text marked up with a tag, so that a partial document can be easily searched (for example, see Patent Document 1 below).

特開平０８−０４４７６６号公報Japanese Patent Application Laid-Open No. 08-044766

しかしながら、上記特許文献１に記載の従来技術にあっては、検索条件に合致する最小単位の要素が断片的に複数検索されてしまうといった問題が生じていた。たとえば、検索条件に合致する複数の要素が、上位の要素によって包括されている場合、ユーザにとっては、この上位の要素が検索結果として検索されるほうが好ましい。 However, in the prior art described in Patent Document 1, there has been a problem that a plurality of elements in the minimum unit that match the search conditions are searched in pieces. For example, when a plurality of elements that match the search condition are included in the upper element, it is preferable for the user to search for the upper element as a search result.

この発明は、上述した従来技術による問題点を解消するため、階層構造化された文書セットの中から、適切な要素を短時間で検索することができる文書検索方法、文書検索装置および文書検索プログラムを提供することを目的とする。 In order to solve the above-described problems caused by the prior art, the present invention provides a document search method, a document search apparatus, and a document search program capable of searching for appropriate elements in a hierarchically structured document set in a short time. The purpose is to provide.

上述した課題を解決し、目的を達成するため、この発明にかかる文書検索方法は、階層構造化された文書セットを取得する取得工程と、前記取得工程によって取得された文書セットから、当該文書セットに含まれたノードごとに当該ノードが属する親ノードの位置を示す位置情報を含む、ノードリストを生成する生成工程と、任意の検索条件の入力を受け付ける入力工程と、前記ノードリストに示されたノードごとに、前記検索条件の合致度を示すスコアを算出する算出工程と、前記ノードリストに示されたノードごとに、前記位置情報を参照して、前記ノードリストに示されているノードの中から、当該ノードが属する親ノードを特定する特定工程と、前記ノードリストに示されたノードごとに、当該ノードのスコアに基づく加算値を、前記特定工程によって特定された親ノードのスコアに加算する加算工程と、前記加算工程による加算後のスコアに基づいて、前記ノードリストに示されたノードの中から、検索結果として出力するノードを選択する選択工程と、前記選択工程によって選択されたノードを出力する出力工程と、を含んだことを特徴とする。 In order to solve the above-described problems and achieve the object, a document search method according to the present invention includes an acquisition step of acquiring a hierarchically structured document set, and the document set acquired from the acquisition step by the acquisition step. A generation process for generating a node list including position information indicating a position of a parent node to which the node belongs, an input process for receiving an input of an arbitrary search condition, and the node list For each node, a calculation step for calculating a score indicating the degree of match of the search condition, and for each node indicated in the node list, the position information is referred to and the node information indicated in the node list From the specific step of identifying the parent node to which the node belongs, and for each node indicated in the node list, an addition value based on the score of the node is A node to be output as a search result is selected from the nodes indicated in the node list based on the addition step of adding to the score of the parent node specified by the specifying step and the score after the addition by the addition step It includes a selection step and an output step for outputting the node selected by the selection step.

この発明によれば、検索条件の合致度が高いノードの親ノード、または検索条件に合致するノードを多く含む親ノードを、検索条件の合致度が高いノードとして扱うことができる。 According to the present invention, a parent node of a node having a high degree of match of the search condition or a parent node including many nodes that match the search condition can be handled as a node having a high degree of match of the search condition.

また、上記に記載の発明において、前記生成工程は、前記ノードリストに示されたノードごとに、当該ノードのインデックスと当該ノードが属する親ノードのインデックスとの差分値を、前記位置情報として前記ノードリストに含めることを特徴とする。 In the above-described invention, the generation step may use, for each node indicated in the node list, the difference value between the index of the node and the index of a parent node to which the node belongs as the position information as the node. It is characterized by being included in the list.

この発明によれば、少ない情報量で、親ノードの位置を表現できる。 According to the present invention, the position of the parent node can be expressed with a small amount of information.

また、上記に記載の発明において、前記加算工程は、前記ノードリストに示された下位のノードから順に、前記ノードのスコアに基づく加算値を、前記親ノードのスコアに加算することを特徴とする。 In the invention described above, the adding step is characterized in that an addition value based on a score of the node is added to a score of the parent node in order from a lower node shown in the node list. .

この発明によれば、それぞれのノードについての加算処理を１回で済ませることができる。 According to the present invention, the addition process for each node can be completed once.

また、上記に記載の発明において、前記算出工程は、ＴＦ−ＩＤＦ法を用いて、前記ノードリストに示されたノードごとに、前記検索条件の合致度を示すスコアを算出することを特徴とする。 In the above-described invention, the calculating step calculates a score indicating a degree of match of the search condition for each node indicated in the node list using a TF-IDF method. .

この発明によれば、ＴＦ−ＩＤＦ法を用いてスコアを算出することにより、単に検索条件に含まれるキーワードが多く出現するノードではなく、そのキーワードをノードの特徴的なものとするノードを、検索条件の合致度が高いノードとして扱うことができる。 According to the present invention, by calculating a score using the TF-IDF method, a node that makes the keyword characteristic of the node is searched, not just a node in which many keywords included in the search condition appear. It can be handled as a node with a high degree of matching of conditions.

また、上記に記載の発明において、前記加算工程は、前記ノードリストに示されたノードごとに、当該ノードのスコアを、前記親ノードのスコアに加算することを特徴とする。 In the invention described above, the adding step adds the score of the node to the score of the parent node for each node indicated in the node list.

この発明によれば、検索条件の合致度に応じたスコアを親ノードに加算することができる。 According to this invention, it is possible to add a score corresponding to the matching degree of the search condition to the parent node.

また、上記に記載の発明において、前記加算工程は、前記ノードリストに示されたノードごとに、当該ノードのスコアと当該ノードの位置情報とに基づく加算値を、前記親ノードのスコアに加算することを特徴とする。 In the invention described above, the adding step adds, for each node indicated in the node list, an added value based on the score of the node and the position information of the node to the score of the parent node. It is characterized by that.

この発明によれば、検索条件の合致度および親ノードとの距離を考慮したスコアを親ノードに加算することができる。 According to the present invention, it is possible to add to the parent node a score that considers the degree of match of the search condition and the distance to the parent node.

また、上記に記載の発明において、前記加算工程は、前記ノードリストに示されたノードごとに、当該ノードのスコアと当該ノードの大きさとに基づく加算値を、前記親ノードのスコアに加算することを特徴とする。 In the invention described above, the adding step adds, for each node indicated in the node list, an added value based on the score of the node and the size of the node to the score of the parent node. It is characterized by.

この発明によれば、検索条件の合致度およびノードの大きさを考慮したスコアを親ノードに加算することができる。 According to the present invention, it is possible to add a score considering the degree of match of the search condition and the size of the node to the parent node.

また、上記に記載の発明において、前記加算工程は、前記ノードリストに示されたノードごとに、当該ノードのスコアと当該ノードの位置情報と当該ノードの大きさとに基づく加算値を、前記親ノードのスコアに加算することを特徴とする。 In the invention described above, for each node indicated in the node list, the adding step calculates an added value based on the score of the node, the position information of the node, and the size of the node. It adds to the score of.

この発明によれば、検索条件の合致度、親ノードとの距離、およびノードの大きさを考慮したスコアを親ノードに加算することができる。 According to the present invention, a score considering the degree of match of the search condition, the distance to the parent node, and the size of the node can be added to the parent node.

また、この発明にかかる文書検索装置は、階層構造化された文書セットを取得する取得手段と、前記取得手段によって取得された文書セットから、当該文書セットに含まれたノードごとに当該ノードが属する親ノードの位置を示す位置情報を含む、ノードリストを生成する生成手段と、任意の検索条件の入力を受け付ける入力手段と、前記ノードリストに示されたノードごとに、前記検索条件の合致度を示すスコアを算出する算出手段と、前記ノードリストに示されたノードごとに、前記位置情報を参照して、前記ノードリストに示されているノードの中から、当該ノードが属する親ノードを特定する特定手段と、前記ノードリストに示されたノードごとに、当該ノードのスコアに基づく加算値を、前記特定手段によって特定された親ノードのスコアに加算する加算手段と、前記加算手段による加算後のスコアに基づいて、前記ノードリストに示されたノードの中から、検索結果として出力するノードを選択する選択手段と、前記選択手段によって選択されたノードを出力する出力手段と、を備えたことを特徴とする。 The document search apparatus according to the present invention includes an acquisition unit that acquires a hierarchically structured document set, and the node belongs to each node included in the document set from the document set acquired by the acquisition unit. A generation means for generating a node list including position information indicating a position of a parent node, an input means for receiving an input of an arbitrary search condition, and a degree of match of the search condition for each node indicated in the node list For each node indicated in the node list, a calculation means for calculating the score to be indicated is referred to, and the parent node to which the node belongs is identified from the nodes indicated in the node list with reference to the position information For each node indicated in the node list, the adding means based on the score of the node is determined as the parent node specified by the specifying means. An adding means for adding to the score, a selecting means for selecting a node to be output as a search result from the nodes indicated in the node list based on the score after the addition by the adding means, and a selection by the selecting means Output means for outputting the selected node.

また、この発明にかかる文書検索プログラムは、階層構造化された文書セットを取得する取得工程と、前記取得工程によって取得された文書セットから、当該文書セットに含まれたノードごとに当該ノードが属する親ノードの位置を示す位置情報を含む、ノードリストを生成する生成工程と、任意の検索条件の入力を受け付ける入力工程と、前記ノードリストに示されたノードごとに、前記検索条件の合致度を示すスコアを算出する算出工程と、前記ノードリストに示されたノードごとに、前記位置情報を参照して、前記ノードリストに示されているノードの中から、当該ノードが属する親ノードを特定する特定工程と、前記ノードリストに示されたノードごとに、当該ノードのスコアに基づく加算値を、前記特定工程によって特定された親ノードのスコアに加算する加算工程と、前記加算工程による加算後のスコアに基づいて、前記ノードリストに示されたノードの中から、検索結果として出力するノードを選択する選択工程と、前記選択工程によって選択されたノードを出力する出力工程と、をコンピュータに実行させることを特徴とする。 The document search program according to the present invention includes an acquisition step of acquiring a hierarchically structured document set, and the node belongs to each node included in the document set from the document set acquired by the acquisition step. A generation process for generating a node list including position information indicating a position of a parent node, an input process for receiving an input of an arbitrary search condition, and a degree of match of the search condition for each node indicated in the node list For each node shown in the node list, a calculation step for calculating a score to be shown is referred to, and the parent node to which the node belongs is identified from among the nodes shown in the node list with reference to the position information For each node indicated in the node list in the specifying step, an added value based on the score of the node is determined as the parent specified in the specifying step. An addition step of adding to the score of the node, a selection step of selecting a node to be output as a search result from the nodes indicated in the node list based on the score after the addition in the addition step, and the selection An output step of outputting a node selected by the step is executed by a computer.

この発明によれば、検索条件の合致度が高いノードの親ノード、または検索条件に合致するノードを多く含む親ノードを、検索条件の合致度が高いノードとしてコンピュータに扱わせることができる。 According to the present invention, it is possible to cause a computer to treat a parent node of a node having a high matching degree of the search condition or a parent node including many nodes matching the search condition as a node having a high matching degree of the search condition.

本発明にかかる文書検索方法、文書検索装置および文書検索プログラムによれば、階層構造化された文書セットの中から、適切な要素を短時間で検索することができるという効果を奏する。 According to the document search method, document search apparatus, and document search program of the present invention, there is an effect that an appropriate element can be searched in a short time from a hierarchically structured document set.

以下に添付図面を参照して、この発明にかかる文書検索方法、文書検索装置および文書検索プログラムの好適な実施の形態を、階層構造化された文書セットの一例としてＸＭＬ文書を用いて詳細に説明する。 Exemplary embodiments of a document search method, a document search apparatus, and a document search program according to the present invention will be described below in detail using an XML document as an example of a hierarchically structured document set with reference to the accompanying drawings. To do.

（文書検索装置１００のハードウェア構成）
まず、この実施の形態にかかる文書検索装置のハードウェア構成について説明する。図１は、この実施の形態にかかる文書検索装置のハードウェア構成の一例を示すブロック図である。 (Hardware configuration of document search apparatus 100)
First, the hardware configuration of the document search apparatus according to this embodiment will be described. FIG. 1 is a block diagram showing an example of a hardware configuration of the document search apparatus according to this embodiment.

図１において、文書検索装置１００は、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）１０１と、ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）１０２と、ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）１０３と、ＨＤＤ（ＨａｒｄＤｉｓｃＤｒｉｖｅ）１０４と、ＨＤ（ＨａｒｄＤｉｓｃ）１０５と、ＦＤＤ（ＦｌｅｘｉｂｌｅＤｉｓｃＤｒｉｖｅ）１０６と、ＦＤ（ＦｌｅｘｉｂｌｅＤｉｓｃ）１０７と、ＣＤ−ＲＷ（ＣｏｍｐａｃｔＤｉｓｃＲｅＷｒｉｔａｂｌｅ）ドライブ１０８と、ＣＤ−ＲＷ１０９と、ディスプレイ１１０と、キーボード１１１と、マウス１１２と、ネットワークＩ／Ｆ（インタフェース）１１３と、通信ケーブル１１４と、バス１２０とを備えて構成されている。 In FIG. 1, a document search apparatus 100 includes a CPU (Central Processing Unit) 101, a ROM (Read Only Memory) 102, a RAM (Random Access Memory) 103, an HDD (Hard Disc Drive) 104, and an HD (Hard Disc). ) 105, FDD (Flexible Disc Drive) 106, FD (Flexible Disc) 107, CD-RW (Compact Disc ReWriteable) drive 108, CD-RW 109, display 110, keyboard 111, mouse 112, A network I / F (interface) 113, a communication cable 114, and a bus 120 are provided.

ＣＰＵ１０１は、文書検索装置１００全体を制御する。ＲＯＭ１０２は、各種制御プログラムなどを格納する。ＲＡＭ１０３は、可変的なデータを書き換え自在に記憶し、ＣＰＵ１０１のワークエリアとして機能する。ＨＤＤ１０４は、ＣＰＵ１０１の制御にしたがってＨＤ１０５に対するデータのリード／ライトを制御する。ＨＤ１０５は、ＨＤＤ１０４の制御にしたがって書き込まれたデータを記憶する。 The CPU 101 controls the entire document search apparatus 100. The ROM 102 stores various control programs and the like. The RAM 103 stores variable data in a rewritable manner and functions as a work area for the CPU 101. The HDD 104 controls reading / writing of data with respect to the HD 105 according to the control of the CPU 101. The HD 105 stores data written according to the control of the HDD 104.

ＦＤＤ１０６は、ＣＰＵ１０１の制御にしたがってＦＤ１０７に対するデータのリード／ライトを制御する。ＦＤ１０７は、着脱自在であり、ＦＤＤ１０６の制御にしたがって書き込まれたデータを記憶する。ＣＤ−ＲＷドライブ１０８は、ＣＰＵ１０１の制御にしたがってＣＤ−ＲＷ（または、ＣＤ−Ｒ、ＣＤ−ＲＯＭ）１０９に対するデータのリード／ライトを制御する。ＣＤ−ＲＷ１０９は、着脱自在であり、ＣＤ−ＲＷドライブ１０８の制御にしたがって書き込まれたデータを記憶する。 The FDD 106 controls reading / writing of data with respect to the FD 107 according to the control of the CPU 101. The FD 107 is detachable and stores data written according to the control of the FDD 106. The CD-RW drive 108 controls reading / writing of data with respect to the CD-RW (or CD-R, CD-ROM) 109 according to the control of the CPU 101. The CD-RW 109 is detachable and stores data written according to the control of the CD-RW drive 108.

ディスプレイ１１０は、カーソル、メニュー、ウィンドウ、あるいは文字や画像などの各種データを表示する。キーボード１１１は、文字、数値、各種指示などの入力のための複数のキーを備える。マウス１１２は、各種指示の選択や実行、処理対象の選択、マウスポインタの移動などを行う。ネットワークＩ／Ｆ１１３は、通信ケーブル１１４を介してＬＡＮ、ＷＡＮ、インターネットなどのネットワークに接続され、当該ネットワークとＣＰＵ１０１とのインタフェースとして機能する。バス１２０は上記各部を接続する。 The display 110 displays a cursor, a menu, a window, or various data such as characters and images. The keyboard 111 includes a plurality of keys for inputting characters, numerical values, various instructions, and the like. The mouse 112 selects and executes various instructions, selects a processing target, moves a mouse pointer, and the like. A network I / F 113 is connected to a network such as a LAN, WAN, or the Internet via a communication cable 114, and functions as an interface between the network and the CPU 101. A bus 120 connects the above-described units.

（文書検索装置１００の機能的構成）
つぎに、この実施の形態にかかる文書検索装置１００の機能的構成について説明する。図２は、この実施の形態にかかる文書検索装置１００の機能的構成を示すブロック図である。 (Functional configuration of document search apparatus 100)
Next, a functional configuration of the document search apparatus 100 according to this embodiment will be described. FIG. 2 is a block diagram showing a functional configuration of the document search apparatus 100 according to this embodiment.

図２に示すように、文書検索装置１００は、取得部２０１と、生成部２０２と、入力部２０３と、算出部２０４と、特定部２０５と、加算部２０６と、選択部２０７と、出力部２０８と、を備えて構成されている。 As shown in FIG. 2, the document search apparatus 100 includes an acquisition unit 201, a generation unit 202, an input unit 203, a calculation unit 204, a specification unit 205, an addition unit 206, a selection unit 207, and an output unit. 208.

取得部２０１は、ＸＭＬ文書を取得する。たとえば取得部２０１は、ユーザによって指定されたＸＭＬ文書ファイルを読み取ることによってＸＭＬ文書を取得する。この場合、ＸＭＬ文書ファイルは、文書検索装置１００内部に記憶されているものに限らず、たとえば、文書検索装置１００と接続された他の装置に記憶されているものであってもよい。また、取得部２０１は、複数のＸＭＬ文書を取得してもよい。この場合、たとえば、所定の格納場所から、複数のＸＭＬ文書を取得してもよい。また、ユーザによって指定された格納場所から、複数のＸＭＬ文書を取得してもよい。取得部２０１は、具体的には、たとえば図１に示したネットワークＩ／Ｆ１１３によってその機能を実現する。 The acquisition unit 201 acquires an XML document. For example, the acquisition unit 201 acquires an XML document by reading an XML document file designated by the user. In this case, the XML document file is not limited to the one stored in the document search device 100, and may be stored in another device connected to the document search device 100, for example. The acquisition unit 201 may acquire a plurality of XML documents. In this case, for example, a plurality of XML documents may be acquired from a predetermined storage location. Further, a plurality of XML documents may be acquired from the storage location designated by the user. Specifically, the acquiring unit 201 realizes its function by, for example, the network I / F 113 shown in FIG.

生成部２０２は、取得部２０１によって取得されたＸＭＬ文書から、当該文書セットに含まれたノードごとに当該ノードが属する親ノードの位置を示す位置情報を含む、ノードリストを生成する。ここでいうノードリストとは、木構造にモデル化されたＸＭＬ文書に基づいてＸＭＬ文書内に存在する全ての要素ノードをリスト化したものである。ノードリストでは、要素ノードごとに、インデックス、パスなどの情報を含む。要素ノードにテキストノードが属している場合は、そのテキストノードの、インデックス、テキストなどの情報が関連付けられる。 The generation unit 202 generates a node list including position information indicating the position of the parent node to which the node belongs for each node included in the document set, from the XML document acquired by the acquisition unit 201. The node list here is a list of all element nodes existing in the XML document based on the XML document modeled in a tree structure. The node list includes information such as an index and a path for each element node. When a text node belongs to an element node, information such as an index and text of the text node is associated.

生成部２０２は、ノードリストに示されたノードごとに、当該ノードのインデックスと当該ノードが属する親ノードとの相対位置を示す位置情報をノードリストに含めてもよい。たとえば、生成部２０２は、ノードリストに示されたノードごとに、当該ノードのインデックスと当該ノードが属する親ノードのインデックスとの差分値を、位置情報としてノードリストに含めてもよい。具体的に説明すると、インデックス「１００６」が付与されたノードの親ノードのインデックスが「１００２」である場合、生成部２０２は、インデックス「１００６」が付与されたノードの位置情報を「４」とする。生成部２０２は、生成したノードリストを、メモリ上に一時的に記憶させる。 For each node indicated in the node list, the generation unit 202 may include position information indicating the relative position between the index of the node and the parent node to which the node belongs in the node list. For example, for each node indicated in the node list, the generation unit 202 may include a difference value between the index of the node and the index of the parent node to which the node belongs in the node list as position information. More specifically, when the index of the parent node of the node assigned the index “1006” is “1002”, the generation unit 202 sets the position information of the node assigned the index “1006” as “4”. To do. The generation unit 202 temporarily stores the generated node list on the memory.

生成部２０２は、ＸＭＬ文書に存在する全てのノードに関するノードリストを生成するだけでなく、所定の範囲内のノードに関するノードリストや、ユーザによって指定された範囲内のノードに関するノードリストを生成するようにしてもよい。また、生成部２０２は、ノードリストに示されたノードごとに、当該ノードのインデックスと当該ノードが属する親ノードの絶対位置を示す位置情報（たとえば、親ノードのインデックス）をノードリストに含めてもよい。なお、ノードリストの具体的な生成手順については、図４を用いて後述する。生成部２０２は、具体的には、たとえば図１に示したＲＯＭ１０２、ＲＡＭ１０３、ＨＤ１０５、ＦＤ１０７に記憶されたプログラムをＣＰＵ１０１が実行することによってその機能を実現する。 The generation unit 202 not only generates a node list regarding all nodes existing in the XML document, but also generates a node list regarding nodes within a predetermined range and a node list regarding nodes within a range specified by the user. It may be. Further, the generation unit 202 may include, for each node indicated in the node list, the node list and position information (for example, the parent node index) indicating the absolute position of the parent node to which the node belongs. Good. A specific procedure for generating the node list will be described later with reference to FIG. Specifically, the generation unit 202 realizes its function by the CPU 101 executing programs stored in the ROM 102, RAM 103, HD 105, and FD 107 shown in FIG.

入力部２０３は、任意の検索条件の入力を受け付ける。ここで、検索条件とは、任意の自然文（検索クエリ文）、任意のデータなどである。たとえば、検索条件は「Ｊ社Ｉ太郎」や「（Ｊ社Ｉ太郎）」のように入力され、前者は、「Ｊ社」および「Ｉ太郎」の両方を含む、を意味し、後者は、「Ｊ社」または「Ｉ太郎」のいずれかを含む、を意味する。ここで、検索条件は、ユーザが文書検索装置１００に直接入力したものに限らず、たとえば、文書検索装置１００と接続された他の装置から送信されたものであってもよい。入力部２０３は、具体的には、たとえば図１に示したキーボード１１１、マウス１１２、ネットワークＩ／Ｆ１１３によってその機能を実現する。 The input unit 203 receives an input of an arbitrary search condition. Here, the search condition is an arbitrary natural sentence (search query sentence), arbitrary data, or the like. For example, the search condition is input as “J company I Taro” or “(J company I Taro)”, the former means that both “J company” and “I Taro” are included, and the latter is It means that either “Company J” or “I Taro” is included. Here, the search conditions are not limited to those directly input to the document search apparatus 100 by the user, but may be those transmitted from other apparatuses connected to the document search apparatus 100, for example. Specifically, the input unit 203 realizes its function by, for example, the keyboard 111, the mouse 112, and the network I / F 113 shown in FIG.

算出部２０４は、ノードリストに示されているノードごとに、入力部２０３によって入力された検索条件の合致度を示すスコアを算出する。たとえば、スコア（ＴＦ−ＩＤＦ）は、以下算出式（１）により求めることができる。 The calculation unit 204 calculates a score indicating the degree of match of the search condition input by the input unit 203 for each node indicated in the node list. For example, the score (TF-IDF) can be obtained by the following calculation formula (1).

ＴＦＩＤＦ＝ＴＦ×ｌｏｇ（Ｎ／ＤＦ）・・・（１） TFIDF = TF × log (N / DF) (1)

上記算出式（１）において、ＴＦは、テキストノード内における検索文字列の出現数を示す。また、Ｎは、全テキストノード数を示す。そして、ＤＦは、検索文字列を含むテキストノード数を示す。 In the calculation formula (1), TF indicates the number of appearances of the search character string in the text node. N indicates the total number of text nodes. DF indicates the number of text nodes including the search character string.

なお、本実施の形態においては、ＴＦ−ＩＤＦ法を用いてスコアを算出しているが、これに限らず、他の方法を用いて、スコアを算出するようにしてもよい。算出部２０４は、具体的には、たとえば図１に示したＲＯＭ１０２、ＲＡＭ１０３、ＨＤ１０５、ＦＤ１０７に記憶されたプログラムをＣＰＵ１０１が実行することによってその機能を実現する。 In the present embodiment, the score is calculated using the TF-IDF method. However, the present invention is not limited to this, and the score may be calculated using another method. Specifically, the calculation unit 204 realizes its function when the CPU 101 executes programs stored in the ROM 102, the RAM 103, the HD 105, and the FD 107 shown in FIG.

特定部２０５は、ノードリストに示されたノードごとに、位置情報を参照して、ノードリストに示されているノードの中から、当該ノードが属する親ノードを特定する。たとえば、ノードのインデックスと親ノードのインデックスとの差分値が位置情報とされた場合、ノードのインデックスと位置情報とから、親ノードのインデックスを特定する。たとえば、インデックス「１００６」が付与されたノードの位置情報が「４」である場合、特定部２０５は、インデックス「１００２」が付与されたノードを、インデックス「１００６」が付与されたノードの親ノードとして特定する。特定部２０５は、具体的には、たとえば図１に示したＲＯＭ１０２、ＲＡＭ１０３、ＨＤ１０５、ＦＤ１０７に記憶されたプログラムをＣＰＵ１０１が実行することによってその機能を実現する。 The identifying unit 205 identifies the parent node to which the node belongs from among the nodes indicated in the node list by referring to the position information for each node indicated in the node list. For example, when the difference value between the index of the node and the index of the parent node is the position information, the index of the parent node is specified from the node index and the position information. For example, when the position information of the node assigned with the index “1006” is “4”, the identifying unit 205 sets the node assigned with the index “1002” as the parent node of the node assigned with the index “1006”. As specified. Specifically, the specifying unit 205 realizes its function when the CPU 101 executes a program stored in the ROM 102, the RAM 103, the HD 105, and the FD 107 shown in FIG.

加算部２０６は、ノードリストに示されたノードごとに、当該ノードのスコアに基づく加算値を、特定部２０５によって特定された親ノードのスコアに加算する。加算部２０６は、ノードリストに示された下位のノードから順に、すなわち、インデックス値の逆順に、ノードのスコアに基づく加算値を、親ノードのスコアに加算する。特に、加算部２０６は、ノードリストの最後尾（ＸＭＬ構造の最右端）から、順に親ノードに向かってスコアまたは加算値を加算することが好ましい。この処理はワンパスで走査が可能であり、メモリ領域に収まるサイズならば、非常に高速に処理できる。 For each node indicated in the node list, the adding unit 206 adds an addition value based on the score of the node to the score of the parent node specified by the specifying unit 205. The addition unit 206 adds the addition value based on the score of the node in order from the lower node shown in the node list, that is, in the reverse order of the index value, to the score of the parent node. In particular, the adding unit 206 preferably adds a score or an added value sequentially from the tail of the node list (the rightmost end of the XML structure) toward the parent node. This process can be scanned in one pass, and can be performed at a very high speed if it fits in the memory area.

たとえば、パス「Ａ／Ｂ／Ｄ」によって特定されるノードのスコアが「５」であり、このノードが属する親ノード「Ａ／Ｂ」のスコアが「５」であった場合、加算部２０６による加算処理によって、ノード「Ａ／Ｂ」のスコアは「１０」となる。 For example, when the score of the node specified by the path “A / B / D” is “5” and the score of the parent node “A / B” to which this node belongs is “5”, the addition unit 206 performs By the addition process, the score of the node “A / B” becomes “10”.

なお、加算部２０６は、ノードリストに示されたノードごとに、当該ノードのスコアと当該ノードの位置情報とを用いて加算値を算出し、算出した加算値を特定部２０５によって特定された親ノードのスコアに加算してもよい。たとえば、加算値（ＡＤＤ）は、以下算出式（２）により求めるようにしてもよい。 The adding unit 206 calculates an added value for each node shown in the node list by using the score of the node and the position information of the node, and the calculated added value is the parent specified by the specifying unit 205. You may add to the score of a node. For example, the addition value (ADD) may be obtained by the following calculation formula (2).

ＡＤＤ＝ＳＣＯＲＥ／ｌｏｇ（ＯＦＦＳＥＴ＋１．１）・・・（２） ADD = SCORE / log (OFFSET + 1.1) (2)

上記算出式（２）において、ＳＣＯＲＥは、ノードのスコアを示す。また、ＯＦＦＳＥＴは、位置情報を示す。上記算出式（２）によって算出された値を加算値とすることにより、親ノードとの距離が離れたノードほど、親ノードとの関連度が低いノードとして扱い、スコアを減じることができる。 In the calculation formula (2), SCORE indicates a score of a node. OFFSET indicates position information. By using the value calculated by the calculation formula (2) as an added value, a node that is farther from the parent node can be treated as a node having a lower degree of association with the parent node, and the score can be reduced.

なお、加算部２０６は、上記以外の算出式を用いて加算値を算出してもよい。たとえば、加算部２０６は、ノードのスコアとノードの大きさとを用いて加算値を算出し、算出した加算値を特定部２０５によって特定された親ノードのスコアに加算してもよい。また、加算部２０６は、ノードのスコアとノードの大きさとノードの位置情報とを用いて加算値を算出し、算出した加算値を特定部２０５によって特定された親ノードのスコアに加算してもよい。 The adding unit 206 may calculate the added value using a calculation formula other than the above. For example, the adding unit 206 may calculate an added value using the node score and the node size, and add the calculated added value to the score of the parent node specified by the specifying unit 205. Further, the adding unit 206 calculates an added value using the node score, the node size, and the node position information, and adds the calculated added value to the score of the parent node specified by the specifying unit 205. Good.

加算部２０６は、具体的には、たとえば図１に示したＲＯＭ１０２、ＲＡＭ１０３、ＨＤ１０５、ＦＤ１０７に記憶されたプログラムをＣＰＵ１０１が実行することによってその機能を実現する。 Specifically, the adding unit 206 realizes its function by the CPU 101 executing a program stored in the ROM 102, the RAM 103, the HD 105, and the FD 107 shown in FIG.

選択部２０７は、加算部２０６による加算後のスコアに基づいて、ノードリストに示されたノードの中から、検索結果として出力するノードを選択する。たとえば、選択部２０７は、加算部２０６による加算後のスコアに基づいて、ノードリストに示されたノードを検索条件の合致度順にソートする。そして、選択部２０７は、スコアの高いノードから順に、ソート後のノードリストの中から、合致度の高いノードを所定数選択する。 The selection unit 207 selects a node to be output as a search result from the nodes indicated in the node list based on the score after addition by the addition unit 206. For example, the selection unit 207 sorts the nodes shown in the node list in the order of matching degree of the search condition based on the score after the addition by the addition unit 206. Then, the selection unit 207 selects a predetermined number of nodes with a high degree of matching from the sorted node list in order from the node with the highest score.

なお、選択部２０７によって選択されるノードの数は、あらかじめ設定されていてもよく、ユーザによって指定されたものであってもよい。選択部２０７は、具体的には、たとえば図１に示したＲＯＭ１０２、ＲＡＭ１０３、ＨＤ１０５、ＦＤ１０７に記憶されたプログラムをＣＰＵ１０１が実行することによってその機能を実現する。 Note that the number of nodes selected by the selection unit 207 may be set in advance or specified by the user. Specifically, the selection unit 207 realizes its function when the CPU 101 executes a program stored in the ROM 102, RAM 103, HD 105, and FD 107 shown in FIG.

出力部２０８は、選択部２０７によって選択されたノードを出力する。たとえば、出力部２０８は、選択部２０７によって選択されたノードを、検索条件の合致度が高い順に表示する。なお、出力部２０８は、選択部２０７によって選択されたノードを表示するだけでなく、たとえば、ファイルに出力したり、文書検索装置１００と接続された他の装置へ送信してもよい。たとえば、検索条件が他の装置から送信された場合、検索条件の送信元の装置へ送信してもよい。出力部２０８は、具体的には、たとえば図１に示したディスプレイ１１０、ネットワークＩ／Ｆ１１３によってその機能を実現する。 The output unit 208 outputs the node selected by the selection unit 207. For example, the output unit 208 displays the nodes selected by the selection unit 207 in descending order of the matching degree of the search condition. Note that the output unit 208 may not only display the node selected by the selection unit 207 but also output it to a file or send it to another device connected to the document search device 100, for example. For example, when the search condition is transmitted from another device, it may be transmitted to the device that is the transmission source of the search condition. Specifically, the output unit 208 realizes its function by, for example, the display 110 and the network I / F 113 shown in FIG.

（ＸＭＬ文書の一例）
つぎに、この発明の実施の形態にかかる文書検索装置１００に用いられるＸＭＬ文書の一例について説明する。図３は、この発明の実施の形態にかかる文書検索装置１００に用いられるＸＭＬ文書の一例を示す説明図である。 (Example of XML document)
Next, an example of an XML document used in the document search apparatus 100 according to the embodiment of the present invention will be described. FIG. 3 is an explanatory diagram showing an example of an XML document used in the document search apparatus 100 according to the embodiment of the present invention.

図３は、木構造にモデル化されたＸＭＬ文書「ｃ：￥ｄｏｃｕｍｅｎｔｓ￥０１２３．ｘｍｌ」を示したものである。図３において、「１００１」〜「１０１３」は要素ノードを示し、各数字はそのインデックスを示す。また、「Ａ」〜「Ｅ」はテキストノードを示し、各英字はそのインデックスを示す。 FIG. 3 shows an XML document “c: ¥ documents ¥ 0123.xml” modeled in a tree structure. In FIG. 3, “1001” to “1013” indicate element nodes, and each numeral indicates an index thereof. “A” to “E” indicate text nodes, and each alphabetic character indicates its index.

図３において、たとえば、インデックス「１００４」が付与された要素ノードには、インデックス「Ａ」が付与されたテキストノードが属している。また、インデックス「Ａ」が付与されたテキストノードは、テキスト「ＸＭＬ，ｓｃｈｅｍｅ」を持つ。たとえば、このテキストノードをタグを用いて示した場合、「＜ａｒｔｉｃｌｅ＞＜ｂｏｄｙ＞＜ｓｅｃ＞＜ｐ１＞ＸＭＬ，ｓｃｈｅｍｅ＜／ｐ１＞＜／ｓｅｃ＞＜／ｂｏｄｙ＞＜／ａｒｔｉｃｌｅ＞」と示すことができる。 In FIG. 3, for example, a text node with index “A” belongs to an element node with index “1004”. The text node to which the index “A” is assigned has the text “XML, scheme”. For example, when this text node is indicated by using a tag, it is indicated as “<article> <body> <sec> <p1> XML, scheme </ p1> </ sec> </ body> </ article>”. Can do.

（ノードリストの生成手順）
つぎに、生成部２０２によるノードリストの生成手順について説明する。図４は、生成部２０２によるノードリストの生成手順の一例を示すフローチャートである。 (Node list generation procedure)
Next, a node list generation procedure by the generation unit 202 will be described. FIG. 4 is a flowchart illustrating an example of a node list generation procedure by the generation unit 202.

まず、木構造にモデル化されたＸＭＬ文書の中から、ルートの要素ノードを選択する（ステップＳ４０１）。たとえば、図３に示したＸＭＬ文書の場合、インデックス「１００１」が付与された要素ノードが選択される。 First, a root element node is selected from an XML document modeled in a tree structure (step S401). For example, in the case of the XML document shown in FIG. 3, the element node assigned with the index “1001” is selected.

つぎに、ステップＳ４０１で選択された要素ノードをノードリストに追加する（ステップＳ４０２）。ここで、ノードリストに追加される情報は、要素ノードのインデックスやパスなどである。たとえば、図３に示したＸＭＬ文書におけるインデックス「１００１」が付与された要素ノードの場合は、インデックス「１００１」やパス「／ａｒｔｉｃｌｅ」などがノードリストに追加される。 Next, the element node selected in step S401 is added to the node list (step S402). Here, the information added to the node list includes the index and path of the element node. For example, in the case of the element node to which the index “1001” in the XML document shown in FIG. 3 is assigned, the index “1001”, the path “/ article”, and the like are added to the node list.

つぎに、位置情報をノードリストに追加する（ステップＳ４０３）。たとえば、位置情報は、選択されたノードのインデックスから親ノードのインデックスを減算することによって求めることができる。たとえば、図３に示したＸＭＬ文書におけるインデックス「１００１」が付与された要素ノードの場合は、親ノードが存在しないことから、位置情報として「０」が求められる。また、図３に示したＸＭＬ文書におけるインデックス「１００６」が付与された要素ノードの場合は、親ノードのインデックスが「１００２」であることから、位置情報として「４」が求められる。 Next, position information is added to the node list (step S403). For example, the location information can be determined by subtracting the index of the parent node from the index of the selected node. For example, in the case of an element node to which the index “1001” in the XML document shown in FIG. 3 is assigned, “0” is obtained as position information because there is no parent node. In the case of the element node to which the index “1006” in the XML document shown in FIG. 3 is assigned, “4” is obtained as the position information because the index of the parent node is “1002”.

つぎに、選択された要素ノードにテキストノードが属しているか否かを判断する（ステップＳ４０４）。たとえば、図３に示したＸＭＬ文書におけるインデックス「１００１」が付与された要素ノードの場合は、テキストノードが属していないと判断され、インデックス「１００４」が付与された要素ノードの場合は、テキストノードが属していると判断される。 Next, it is determined whether or not a text node belongs to the selected element node (step S404). For example, in the case of an element node to which the index “1001” in the XML document shown in FIG. 3 is assigned, it is determined that the text node does not belong, and in the case of an element node to which the index “1004” is assigned, the text node Is determined to belong.

ステップＳ４０４において、テキストノードが属していると判断した場合（ステップＳ４０４：Ｙｅｓ）は、選択された要素ノードと、この要素ノードに属しているテキストノードとの関連付けをおこなって（ステップＳ４０５）、ステップＳ４０６へ進む。 If it is determined in step S404 that the text node belongs (step S404: Yes), the selected element node is associated with the text node belonging to this element node (step S405). The process proceeds to S406.

ここで、要素ノードに関連付けられる情報は、テキストノードのインデックスやテキストなどである。たとえば、図３に示したＸＭＬ文書におけるインデックス「１００４」が付与された要素ノードには、テキストノードのインデックス「Ａ」やテキスト「ＸＭＬ，ｓｃｈｅｍｅ」などが関連付けられる。ステップＳ４０４において、テキストノードが属していないと判断した場合（ステップＳ４０４：Ｎｏ）は、ステップＳ４０５を飛ばして、ステップＳ４０６へ進む。 Here, the information associated with the element node is a text node index or text. For example, the element node to which the index “1004” in the XML document shown in FIG. 3 is assigned is associated with the text node index “A”, the text “XML, scheme”, and the like. If it is determined in step S404 that the text node does not belong (step S404: No), step S405 is skipped and the process proceeds to step S406.

つぎに、ＸＭＬ文書に含まれる全ての要素ノードが選択されたか否かを判断する（ステップＳ４０６）。ステップＳ４０６において、全ての要素ノードが選択されたと判断した場合（ステップＳ４０６：Ｙｅｓ）は、一連の処理を終了する。一方、ステップＳ４０６において、全ての要素ノードが選択されていないと判断した場合（ステップＳ４０６：Ｎｏ）は、ＸＭＬ文書の中から、次の要素ノードを選択する（ステップＳ４０７）。このとき、より上位ノードを優先する。たとえば、図３に示したＸＭＬ文書において、要素ノードが選択される順番は、インデックス値の順番とおりとなる。 Next, it is determined whether or not all element nodes included in the XML document have been selected (step S406). If it is determined in step S406 that all element nodes have been selected (step S406: Yes), the series of processing ends. On the other hand, if it is determined in step S406 that all element nodes have not been selected (step S406: No), the next element node is selected from the XML document (step S407). At this time, the higher order node is prioritized. For example, in the XML document shown in FIG. 3, the order in which element nodes are selected is the order of the index values.

そして、ステップＳ４０２に戻り、ステップＳ４０６で全てのノードが選択されたと判断されるまで、ステップＳ４０２〜ステップＳ４０７を繰り返しおこなう。これにより、ＸＭＬ文書に含まれる全ての要素ノードを、上位のノードから順に、ノードリストに追加することができる。また、ＸＭＬ文書に含まれる全てのテキストノードを、それぞれ、ノードリストに示された要素ノードのいずれかと関連付けることができる。また、ＸＭＬ文書に含まれる全ての要素ノードについての親ノードの位置情報を、ノードリストに追加することができる。 Then, the process returns to step S402, and steps S402 to S407 are repeated until it is determined in step S406 that all nodes have been selected. Accordingly, all element nodes included in the XML document can be added to the node list in order from the upper node. Further, all text nodes included in the XML document can be associated with any one of the element nodes shown in the node list. Further, the position information of the parent node for all the element nodes included in the XML document can be added to the node list.

（ノードリストの一例）
つぎに、ノードリストの一例について説明する。図５は、ノードリストの一例を示す説明図である。 (Example of node list)
Next, an example of the node list will be described. FIG. 5 is an explanatory diagram illustrating an example of a node list.

図５に示すノードリスト５００は、図４を用いて上述した手順によって、図３に示したＸＭＬ文書から生成されたノードリストであり、列「ｉｎｄｅｘ１」,「ｐａｔｈ」，「ｉｎｄｅｘ２」，「ｔｅｘｔ」，「ｐａｒｅｎｔ」によって構成されている。 A node list 500 shown in FIG. 5 is a node list generated from the XML document shown in FIG. 3 by the procedure described above with reference to FIG. 4, and includes columns “index1”, “path”, “index2”, “text”. ”,“ Parent ”.

このうち、列「ｉｎｄｅｘ１」には、要素ノードのインデックスが設定される。また、列「ｐａｔｈ」には、要素ノードのパスが設定される。そして、列「ｉｎｄｅｘ２」には、要素ノードと関連付けられているテキストノードのインデックスが設定される。さらに、列「ｔｅｘｔ」には、要素ノードと関連付けられているテキストノードのテキストが設定される。また、列「ｐａｒｅｎｔ」には、要素ノードが属する親ノードの位置情報が設定される。 Among these, the index of the element node is set in the column “index1”. In the column “path”, a path of an element node is set. In the column “index2”, the index of the text node associated with the element node is set. Further, the text of the text node associated with the element node is set in the column “text”. In the column “parent”, position information of a parent node to which the element node belongs is set.

たとえば、図５に示すノードリスト５００から、インデックス「１００４」が付与されたパス「／ａｒｔｉｃｌｅ／ｂｏｄｙ／ｓｅｃ／ｐ１」によって示される要素ノードには、インデックス「Ａ」が付与され、かつテキスト「ＸＭＬ，ｓｃｈｅｍｅ」を含むテキストノードが関連付けられていると判断することができる。また、インデックス「１００６」が付与されたパス「／ａｒｔｉｃｌｅ／ｂｏｄｙ／ｆｏｏｔｅｒ」によって示される要素ノードについての親ノードの位置情報は、「４」であると判断することができる。 For example, from the node list 500 shown in FIG. 5, the element node indicated by the path “/ article / body / sec / p1” to which the index “1004” is assigned is assigned the index “A” and the text “XML”. , Scheme ”can be determined to be associated. Further, the position information of the parent node regarding the element node indicated by the path “/ article / body / footer” to which the index “1006” is assigned can be determined to be “4”.

このように、ノードリスト５００は、要素ノードのインデックス順でソートされているため、元のＸＭＬ構造を再構成でき、親ノードの情報に容易にアクセスすることが可能である。また、ノードリスト５００を参照することにより、各ノードの持つ情報を、親子関係や祖先子孫関係、兄弟関係などを考慮しながら高速に処理を行うことができる。 As described above, since the node list 500 is sorted in the index order of the element nodes, the original XML structure can be reconfigured, and the parent node information can be easily accessed. Also, by referring to the node list 500, it is possible to process the information held by each node at high speed while considering the parent-child relationship, the ancestor-descendant relationship, the sibling relationship, and the like.

（文書検索装置１００による文書検索処理の手順）
つぎに、この発明の実施の形態にかかる文書検索装置１００による文書検索処理の手順について説明する。図６は、この発明の実施の形態にかかる文書検索装置１００による文書検索処理の手順の一例を示すフローチャートである。 (Procedure for document search processing by the document search apparatus 100)
Next, a procedure for document search processing by the document search apparatus 100 according to the embodiment of the present invention will be described. FIG. 6 is a flowchart showing an example of a procedure of document search processing by the document search apparatus 100 according to the embodiment of the present invention.

まず、取得部２０１によって、ＸＭＬ文書を取得して（ステップＳ６０１）、生成部２０２によって、ステップＳ６０１で取得されたＸＭＬ文書からノードリスト（ノード数Ｎ）を生成する（ステップＳ６０２）。ノードリストの具体的な生成手順については図４を用いて上述したとおりである。 First, the acquisition unit 201 acquires an XML document (step S601), and the generation unit 202 generates a node list (number of nodes N) from the XML document acquired in step S601 (step S602). The specific procedure for generating the node list is as described above with reference to FIG.

つぎに、入力部２０３によって、検索条件の入力を受け付けて（ステップＳ６０３）、算出部２０４によって、ステップＳ６０２で生成されたノードリストに示されている全てのノードに対し、ステップＳ６０３で入力された検索条件に基づいた、検索条件の合致度を示すスコアを算出する（ステップＳ６０４）。 Next, input of the search condition is accepted by the input unit 203 (step S603), and the calculation unit 204 inputs all the nodes shown in the node list generated in step S602 in step S603. Based on the search condition, a score indicating the degree of match of the search condition is calculated (step S604).

続いて、特定部２０５によって、ステップＳ６０２で生成されたノードリストに示されている上位からｉ番目（ｉ＝Ｎ．．．１）のノードを選択して（ステップＳ６０５）、ステップＳ６０５で選択されたノードについて、親ノードを特定する（ステップＳ６０６）。 Subsequently, the identification unit 205 selects the i-th (i = N...) Node from the top shown in the node list generated in step S602 (step S605), and is selected in step S605. The parent node is specified for the node (step S606).

そして、加算部２０６によって、ステップＳ６０５で選択されたノードのスコアを、ステップＳ６０６で特定された親ノードのスコアに加算して（ステップＳ６０７）、ステップＳ６０８へ進む。 Then, the adder 206 adds the score of the node selected in step S605 to the score of the parent node specified in step S606 (step S607), and proceeds to step S608.

続いて、ステップＳ６０２で生成されたノードリストに示されているノードが全て選択されたか否かを判断する（ステップＳ６０８）。ステップＳ６０８において、ノードが全て選択されていないと判断した場合（ステップＳ６０８：Ｎｏ）は、ステップＳ６０８においてノードが全て選択されたと判断されるまで、ステップＳ６０５〜ステップＳ６０８を繰り返しおこなう。 Subsequently, it is determined whether all the nodes shown in the node list generated in step S602 have been selected (step S608). If it is determined in step S608 that all nodes have not been selected (step S608: No), steps S605 to S608 are repeated until it is determined in step S608 that all nodes have been selected.

一方、ステップＳ６０８において、ノードが全て選択されたと判断した場合（ステップＳ６０８：Ｙｅｓ）は、選択部２０７によって、ステップＳ６０７による加算後のスコアに基づいて、ノードリストに示されたノードを検索条件の合致度順にソートして（ステップＳ６０９）、ソートされたノードの中から、合致度の高いノードを所定のノード数選択する（ステップＳ６１０）。 On the other hand, if it is determined in step S608 that all the nodes have been selected (step S608: Yes), the selection unit 207 selects the node indicated in the node list based on the score after the addition in step S607 as a search condition. The nodes are sorted in order of match (step S609), and a node with a high match is selected from the sorted nodes (step S610).

そして、出力部２０８によって、ステップＳ６１０で選択されたノードを出力して（ステップＳ６１１）、一連の処理を終了する。 Then, the output unit 208 outputs the node selected in step S610 (step S611), and the series of processing ends.

（算出部２０４によって算出されたスコアの一例）
つぎに、算出部２０４によって算出されたスコアの一例について説明する。図７は、算出部２０４によって算出されたスコアの一例を示す説明図である。 (Example of score calculated by calculation unit 204)
Next, an example of the score calculated by the calculation unit 204 will be described. FIG. 7 is an explanatory diagram illustrating an example of a score calculated by the calculation unit 204.

図７は、図５に示したノードリスト５００と、算出部２０４によって算出された各要素ノードのスコアを示すスコアリスト７００との関連付けを示したものである。図７に示すスコアリスト７００において、列「ｓｃｏｒｅ１」には、算出部２０４によって算出されたスコアが設定されている。このときの、算出処理に用いられた検索文字列は「ＸＭＬ，ｔａｇ，ｓｃｈｅｍｅ」である。 FIG. 7 shows the association between the node list 500 shown in FIG. 5 and the score list 700 indicating the score of each element node calculated by the calculation unit 204. In the score list 700 illustrated in FIG. 7, the score calculated by the calculation unit 204 is set in the column “score1”. The search character string used in the calculation process at this time is “XML, tag, scheme”.

たとえば、ノードリスト５００およびスコアリスト７００から、インデックス「１００４」が付与されたパス「／ａｒｔｉｃｌｅ／ｂｏｄｙ／ｓｅｃ／ｐ１」によって示される要素ノードには、算出部２０４によって算出されたスコア「３８」が関連付けられていると判断することができる。ここで、このスコア「３８」は以下のＴＦ−ＩＤＦ算出式（３）によって算出されたものである。 For example, from the node list 500 and the score list 700, the element node indicated by the path “/ article / body / sec / p1” to which the index “1004” is assigned has the score “38” calculated by the calculation unit 204. It can be determined that they are associated. Here, this score “38” is calculated by the following TF-IDF calculation formula (3).

３８（ＴＦＩＤＦ：スコア）＝１（ＴＦ：テキストノード内における検索文字列「ＸＭＬ」の出現数）×２０（ＩＤＦ：ｌｏｇ（全テキストノード数／検索文字列「ＸＭＬ」を含むテキストノード数））＋１（ＴＦ：テキストノード内における検索文字列「ｓｃｈｅｍｅ」の出現数）×１８（ＩＤＦ：ｌｏｇ（全テキストノード数／検索文字列「ｓｃｈｅｍｅ」を含むテキストノード数））・・・（３） 38 (TFIDF: score) = 1 (TF: number of occurrences of the search character string “XML” in the text node) × 20 (IDF: log (total number of text nodes / number of text nodes including the search character string “XML”)) +1 (TF: number of occurrences of the search character string “scheme” in the text node) × 18 (IDF: log (total number of text nodes / number of text nodes including the search character string “scheme”)) (3)

また、ノードリスト５００、スコアリスト７００から、インデックス「１００５」が付与されたパス「／ａｒｔｉｃｌｅ／ｂｏｄｙ／ｓｅｃ／ｐ２」によって示される要素ノードには、算出部２０４によって算出されたスコア「８０」が関連付けられていると判断することができる。ここで、このスコア「８０」は以下のＴＦ−ＩＤＦ算出式（４）によって算出されたものである。 Further, from the node list 500 and the score list 700, the element node indicated by the path “/ article / body / sec / p2” to which the index “1005” is assigned has the score “80” calculated by the calculation unit 204. It can be determined that they are associated. Here, this score “80” is calculated by the following TF-IDF calculation formula (4).

８０（ＴＦＩＤＦ：スコア）＝２（ＴＦ：テキストノード内における検索文字列「ｔａｇ」の出現数）×４０（ＩＤＦ：ｌｏｇ（全テキストノード数／検索文字列「ｔａｇ」を含むテキストノード数））・・・（４） 80 (TFIDF: score) = 2 (TF: number of occurrences of the search character string “tag” in the text node) × 40 (IDF: log (total number of text nodes / number of text nodes including the search character string “tag”)) ... (4)

（加算部２０６によって加算されたスコアの一例）
つぎに、加算部２０６によって加算されたスコアの一例について説明する。図８は、加算部２０６によって加算されたスコアの一例を示す説明図である。 (Example of score added by adding unit 206)
Next, an example of the score added by the adding unit 206 will be described. FIG. 8 is an explanatory diagram illustrating an example of scores added by the adding unit 206.

図８は、図５に示したノードリスト５００と、図７に示したスコアリスト７００と、加算部２０６による加算後のスコアを示すスコアリスト８００と、の関連付けを示したものである。図８に示すスコアリスト８００において、列「ｓｃｏｒｅ２」には、加算部２０６によって加算されたスコアが設定されている。 FIG. 8 shows the association between the node list 500 shown in FIG. 5, the score list 700 shown in FIG. 7, and the score list 800 showing the score after the addition by the adding unit 206. In the score list 800 illustrated in FIG. 8, the score added by the adding unit 206 is set in the column “score2”.

たとえば、スコアリスト８００から、インデックス「１０１１」が付与されたパス「／ａｒｔｉｃｌｅ／ｂｏｄｙ／ｓｅｃ／ｔｉｔｌｅ／ｎａｍｅ」によって示される要素ノードには、加算部２０６によって加算されたスコア「６６」が関連付けられていると判断することができる。ここで、このスコア「６６」は、この要素ノードに属する、インデックス「１０１２」が付与された要素ノードのスコア「２２」と、インデックス「１０１３」が付与された要素ノードのスコア「４４」とが加算されたものである。 For example, the score “66” added by the adding unit 206 is associated with the element node indicated by the path “/ article / body / sec / title / name” to which the index “1011” is assigned from the score list 800. Can be determined. Here, the score “66” includes the score “22” of the element node to which the index “1012” is assigned and the score “44” of the element node to which the index “1013” is assigned. It is an addition.

（加算部２０６によって加算されたスコアの他の一例）
つぎに、加算部２０６によって加算されたスコアの他の一例について説明する。図９は、加算部２０６によって加算されたスコアの他の一例を示す説明図である。 (Another example of the score added by the adding unit 206)
Next, another example of the score added by the adding unit 206 will be described. FIG. 9 is an explanatory diagram showing another example of the scores added by the adding unit 206.

図９は、図５に示したノードリスト５００と、図７に示したスコアリスト７００と、加算部２０６による加算後のスコアを示すスコアリスト９００と、の関連付けを示したものである。図９に示すスコアリスト９００において、列「ｓｃｏｒｅ２」には、加算部２０６によって加算されたスコアが設定されている。 FIG. 9 shows the association between the node list 500 shown in FIG. 5, the score list 700 shown in FIG. 7, and the score list 900 showing the score after the addition by the adding unit 206. In the score list 900 illustrated in FIG. 9, the score added by the adding unit 206 is set in the column “score2”.

たとえば、スコアリスト９００から、インデックス「１０１１」が付与されたパス「／ａｒｔｉｃｌｅ／ｂｏｄｙ／ｓｅｃ／ｔｉｔｌｅ／ｎａｍｅ」によって示される要素ノードには、加算部２０４によって加算されたスコア「４４」が関連付けられていると判断することができる。ここで、このスコア「４４」は、この要素ノードに属する、インデックス「１０１２」が付与された要素ノードのスコア「２２」から図２を用いて説明した算出式（２）により算出された加算値「２０」と、インデックス「１０１３」が付与された要素ノードのスコア「４４」から上記算出式（２）により差出された加算値「２４」とが加算されたものである。 For example, the score “44” added by the adding unit 204 is associated with the element node indicated by the path “/ article / body / sec / title / name” to which the index “1011” is assigned from the score list 900. Can be determined. Here, the score “44” is the addition value calculated by the calculation formula (2) described with reference to FIG. 2 from the score “22” of the element node to which the index “1012” is assigned, belonging to this element node. “20” is added to the score “44” of the element node to which the index “1013” is assigned, and the addition value “24” derived from the calculation formula (2) is added.

（出力部２０８によって出力された検索結果の一例）
つぎに、出力部２０８によって出力された検索結果の一例について説明する。図１０は、出力部２０８によって出力された検索結果の一例を示す説明図である。 (Example of search result output by output unit 208)
Next, an example of the search result output by the output unit 208 will be described. FIG. 10 is an explanatory diagram illustrating an example of a search result output by the output unit 208.

図１０は、図３に示したＸＭＬ文書に対して、図６を用いて上述した手順による文書検索処理がおこなわれた結果、出力部２０８によって出力された検索結果を示すものである。図１０に示すように、文書検索処理をおこなうにあたり、検索対象文書「ｃ：￥ｄｏｃｕｍｅｎｔｓ￥０１２３．ｘｍｌ」、検索条件「ＸＭＬ,ｔａｇ,ｓｃｈｅｍｅ」、検索数「３（件）」がユーザによって指定されている。 FIG. 10 shows the search result output by the output unit 208 as a result of performing the document search process according to the procedure described above with reference to FIG. 6 on the XML document shown in FIG. As shown in FIG. 10, in performing the document search process, the search target document “c: ¥ documents ¥ 0123.xml”, the search condition “XML, tag, scheme”, and the number of searches “3 (cases)” are designated by the user. Has been.

そして、「検索」ボタンが押下されたことにより、検索対象文書「ｃ：￥ｄｏｃｕｍｅｎｔｓ￥０１２３．ｘｍｌ」に対する文書検索処理がおこなわれ、その結果として、検索対象文書「ｃ：￥ｄｏｃｕｍｅｎｔｓ￥０１２３．ｘｍｌ」の中から選択された、検索条件の合致度が高い上位３件のノードが検索結果として出力されている。 When the “search” button is pressed, a document search process is performed on the search target document “c: \ documents \ 0123.xml”. As a result, the search target document “c: \ documents \ 0123.xml” is executed. The top three nodes that are selected from “” and have the highest matching degree of the search conditions are output as the search results.

以上説明したように、本実施の形態にかかる文書検索装置１００によれば、ノードリストに示されたノードごとに、当該ノードのスコアに基づく加算値を、親ノードのスコアに加算し、加算後のスコアに基づいて、ノードリストに示されたノードの中から、検索結果として出力するノードを選択する構成とした。このため、階層構造化された文書セットの中から、適切な要素を検索することができる。 As described above, according to the document search device 100 according to the present embodiment, for each node indicated in the node list, the addition value based on the score of the node is added to the score of the parent node, and after the addition Based on the score, a node to be output as a search result is selected from the nodes shown in the node list. Therefore, it is possible to search for an appropriate element from the hierarchically structured document set.

また、本実施の形態にかかる文書検索装置１００によれば、ノードリストに示された下位のノードから順に、ノードのスコアに基づく加算値を、親ノードのスコアに加算する構成とした。特に、位置情報を参照して、親ノードを特定する構成とした。このため、階層構造化された文書セットの中から、適切な要素を短時間で検索することができる。 In addition, according to the document search device 100 according to the present embodiment, the addition value based on the node score is added to the parent node score in order from the lower node shown in the node list. In particular, the configuration is such that the parent node is identified by referring to the position information. Therefore, an appropriate element can be searched in a short time from a hierarchically structured document set.

なお、この発明にかかる文書検索方法、文書検索装置および文書検索プログラムは、階層構造化された文書セットであれば、ＸＭＬ文書以外の文書に対する文書検索にも適用することができる。また、ファイル化された文書に限らず、たとえば、データベース化された文書に対する文書検索にも適用することができる。さらに、単独のファイルにファイル化された文書や単独のデータベースにデータベース化された文書に限らず、複数のファイルにファイル化された文書や、複数のデータベースにデータベース化された文書に対する文書検索にも適用することができる。 Note that the document search method, document search apparatus, and document search program according to the present invention can be applied to document search for documents other than XML documents as long as the document set has a hierarchical structure. Further, the present invention is not limited to a filed document, and can be applied to a document search for a databased document, for example. Furthermore, not only documents filed in a single file or documents databased in a single database, but also document searches for documents filed in multiple files and documents databased in multiple databases. Can be applied.

なお、本実施の形態で説明した文書検索方法は、予め用意されたプログラムをパーソナル・コンピュータやワークステーション等のコンピュータで実行することにより実現することができる。このプログラムは、ハードディスク、フレキシブルディスク、ＣＤ−ＲＯＭ、ＭＯ、ＤＶＤ等のコンピュータで読み取り可能な記憶媒体に記録され、コンピュータによって記憶媒体から読み出されることによって実行される。またこのプログラムは、インターネット等のネットワークを介して配布することが可能な伝送媒体であってもよい。 The document search method described in this embodiment can be realized by executing a program prepared in advance on a computer such as a personal computer or a workstation. This program is recorded on a computer-readable storage medium such as a hard disk, a flexible disk, a CD-ROM, an MO, and a DVD, and is executed by being read from the storage medium by the computer. Further, this program may be a transmission medium that can be distributed via a network such as the Internet.

以上のように、本発明にかかる文書検索方法、文書検索装置および文書検索プログラムは、階層構造化された文書セットから、任意の検索条件に合致するノードを検索するパーソナル・コンピュータ、ドキュメントサーバ、文書検索ソフトウェアなどへの利用に適している。 As described above, the document search method, document search apparatus, and document search program according to the present invention are a personal computer, a document server, and a document that search for a node that matches an arbitrary search condition from a hierarchically structured document set. Suitable for use in search software.

この実施の形態にかかる文書検索装置のハードウェア構成の一例を示すブロック図である。It is a block diagram which shows an example of the hardware constitutions of the document search device concerning this embodiment. この実施の形態にかかる文書検索装置の機能的構成を示すブロック図である。It is a block diagram which shows the functional structure of the document search device concerning this embodiment. この発明の実施の形態にかかる文書検索装置に用いられるＸＭＬ文書の一例を示す説明図である。It is explanatory drawing which shows an example of the XML document used for the document search apparatus concerning embodiment of this invention. 生成部によるノードリストの生成手順の一例を示すフローチャートである。It is a flowchart which shows an example of the production | generation procedure of the node list by a production | generation part. ノードリストの一例を示す説明図である。It is explanatory drawing which shows an example of a node list. この発明の実施の形態にかかる文書検索装置による文書検索処理の手順の一例を示すフローチャートである。It is a flowchart which shows an example of the procedure of the document search process by the document search apparatus concerning embodiment of this invention. 算出部によって算出されたスコアの一例を示す説明図である。It is explanatory drawing which shows an example of the score calculated by the calculation part. 加算部によって加算されたスコアの一例を示す説明図である。It is explanatory drawing which shows an example of the score added by the addition part. 加算部によって加算されたスコアの他の一例を示す説明図である。It is explanatory drawing which shows another example of the score added by the addition part. 出力部によって出力された検索結果の一例を示す説明図である。It is explanatory drawing which shows an example of the search result output by the output part.

Explanation of symbols

１００文書検索装置
１０１ＣＰＵ
１０２ＲＯＭ
１０３ＲＡＭ
１０４ＨＤＤ
１０５ＨＤ
１０６ＦＤＤ
１０７ＦＤ
１０８ＣＤ−ＲＷドライブ
１０９ＣＤ−ＲＷ
１１０ディスプレイ
１１１キーボード
１１２マウス
１１３ネットワークＩ／Ｆ
１１４通信ケーブル
１２０バス
２０１取得部
２０２生成部
２０３入力部
２０４算出部
２０５特定部
２０６加算部
２０７選択部
２０８出力部
５００ノードリスト
７００スコアリスト
８００スコアリスト
９００スコアリスト 100 Document Retrieval Device 101 CPU
102 ROM
103 RAM
104 HDD
105 HD
106 FDD
107 FD
108 CD-RW drive 109 CD-RW
110 Display 111 Keyboard 112 Mouse 113 Network I / F
114 communication cable 120 bus 201 acquisition unit 202 generation unit 203 input unit 204 calculation unit 205 identification unit 206 addition unit 207 selection unit 208 output unit 500 node list 700 score list 800 score list 900 score list

Claims

An acquisition process for acquiring a hierarchically structured document set;
A generation step of generating a node list including position information indicating a position of a parent node to which the node belongs for each node included in the document set from the document set acquired by the acquisition step;
An input process for receiving input of an arbitrary search condition;
A calculation step of calculating a score indicating the degree of match of the search condition for each node indicated in the node list;
For each node indicated in the node list, referring to the location information, a specifying step for specifying a parent node to which the node belongs from among the nodes indicated in the node list;
For each node indicated in the node list, an addition step of adding an addition value based on the score of the node to the score of the parent node identified by the identification step;
A selection step of selecting a node to be output as a search result from the nodes indicated in the node list based on the score after the addition in the addition step;
An output step of outputting the node selected by the selection step;
A document retrieval method characterized by including

The generation step includes
The difference value between the index of the node and the index of a parent node to which the node belongs is included in the node list as the position information for each node indicated in the node list. Document search method.

The adding step includes
The document search method according to claim 1, wherein an addition value based on a score of the node is added to a score of the parent node in order from a lower node shown in the node list.

The calculation step calculates a score indicating a degree of match of the search condition for each node indicated in the node list using a TF-IDF method. Document search method described in one.

5. The document search according to claim 1, wherein the adding step adds the score of the node to the score of the parent node for each node indicated in the node list. Method.

The addition step adds an addition value based on the score of the node and the position information of the node to the score of the parent node for each node indicated in the node list. 5. The document search method according to any one of 4 above.

5. The adding step adds an addition value based on the score of the node and the size of the node to the score of the parent node for each node indicated in the node list. The document search method according to any one of the above.

In the addition step, for each node indicated in the node list, an addition value based on the score of the node, the position information of the node, and the size of the node is added to the score of the parent node. The document search method according to any one of claims 1 to 4.

An acquisition means for acquiring a hierarchically structured document set;
Generating means for generating a node list including position information indicating a position of a parent node to which the node belongs, for each node included in the document set, from the document set acquired by the acquiring means;
An input means for receiving input of an arbitrary search condition;
Calculating means for calculating a score indicating the degree of match of the search condition for each node indicated in the node list;
For each node shown in the node list, referring to the position information, a specifying unit for specifying a parent node to which the node belongs from among the nodes shown in the node list;
For each node indicated in the node list, an adding means for adding an added value based on the score of the node to the score of the parent node specified by the specifying means;
Selection means for selecting a node to be output as a search result from the nodes indicated in the node list based on the score after addition by the addition means;
Output means for outputting the node selected by the selection means;
A document retrieval apparatus comprising:

An acquisition process for acquiring a hierarchically structured document set;
A generation step of generating a node list including position information indicating a position of a parent node to which the node belongs for each node included in the document set from the document set acquired by the acquisition step;
An input process for receiving input of an arbitrary search condition;
A calculation step of calculating a score indicating the degree of match of the search condition for each node indicated in the node list;
For each node indicated in the node list, referring to the location information, a specifying step for specifying a parent node to which the node belongs from among the nodes indicated in the node list;
For each node indicated in the node list, an addition step of adding an addition value based on the score of the node to the score of the parent node identified by the identification step;
A selection step of selecting a node to be output as a search result from the nodes indicated in the node list based on the score after the addition in the addition step;
An output step of outputting the node selected by the selection step;
Document search program characterized by causing a computer to execute.