JP2008146209A

JP2008146209A - Document retrieval device, document retrieval method and document retrieval program

Info

Publication number: JP2008146209A
Application number: JP2006330571A
Authority: JP
Inventors: Hiroki Tanioka; 広樹谷岡
Original assignee: JustSystems Corp
Current assignee: JustSystems Corp
Priority date: 2006-12-07
Filing date: 2006-12-07
Publication date: 2008-06-26

Abstract

<P>PROBLEM TO BE SOLVED: To improve usability and retrieval accuracy in document retrieval processing. <P>SOLUTION: An acquisition part 201 acquires an XML document. A generation part 202 generates a node list from the XML document. An input part 203 receives input of a retrieval condition. A calculation part 204 calculates a score showing matching degree of the retrieval condition for each node shown in the node list. A decision part 205 decides whether each the node shown in the node list satisfies a prescribed matching condition or not. An addition part 206 adds the score of the node decided that the node satisfies the prescribed matching condition by the decision part 205 to the score of a parent node to which the node belongs. A determination part 207 determines the node having the high matching degree of the retrieval condition as a retrieval result from the nodes shown in the node list based on the score added by the addition part 206 and the score calculated by the calculation part 204. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

この発明は、階層構造化された文書セットから、自然文により入力された検索条件に合致するノードを検索する文書検索装置、文書検索方法および文書検索プログラムに関する。 The present invention relates to a document search apparatus, a document search method, and a document search program for searching a node that matches a search condition input by a natural sentence from a hierarchically structured document set.

従来より、一連の文書を、章や節などの適切な単位の部分文書に分割し、階層構造化することによって、出版の多様性、情報検索の的確性、分割／結合の容易性など、多くのメリットを得ることができるものとされている。たとえば、階層構造化された文書の代表的なものとしては、ＸＭＬ文書が挙げられる。このＸＭＬ文書は、タグによって文書要素や文書テキストがマークアップされている。このため、要素ノードおよびテキストノードによって構成される木構造でモデル化することができる。これにより、ノード単位での部分文書の検索をおこなうことができる。 Conventionally, by dividing a series of documents into sub-documents of appropriate units such as chapters and sections and making them into a hierarchical structure, there are many publishing diversity, accuracy of information retrieval, ease of division / combination, etc. It is supposed that the merit of can be obtained. For example, an XML document is a typical example of a hierarchically structured document. In the XML document, document elements and document text are marked up by tags. Therefore, it is possible to model with a tree structure composed of element nodes and text nodes. As a result, it is possible to search for partial documents in node units.

ＸＭＬ文書に対する部分文書の検索処理に用いられる検索方法の一般的なものとして、指定された検索文字列を含む最小単位の部分文書をＸＭＬ文書の中から全て検索する方法が挙げられる（たとえば、下記特許文献１参照。）。 As a general search method used for a partial document search process for an XML document, there is a method for searching all partial documents in a minimum unit including a specified search character string from XML documents (for example, (See Patent Document 1).

特開平１１−３０６２０５号公報JP-A-11-306205

しかしながら、上記特許文献１に記載の従来技術にあっては、検索文字列を含む複数の最小単位の部分文書が同一の部分文書に含まれている場合であっても、複数の最小単位の部分文書が断片的に検索されてしまう。たとえば、文書が、章、節、項、文ごとに分割されていた場合、検索文字列を含む文単位の部分文書が断片的に検索されてしまう。このような場合、ユーザにとっては、複数の最小単位の部分文書を包括する単位（たとえば、章や節など）の部分文書が検索結果として検索されることが望ましいものである。 However, in the conventional technique described in Patent Document 1, a plurality of minimum unit parts are included even when a plurality of minimum unit partial documents including a search character string are included in the same partial document. Documents are searched in pieces. For example, when a document is divided into chapters, sections, terms, and sentences, partial documents in units of sentences including a search character string are searched in fragments. In such a case, it is desirable for the user to search as a search result a partial document of a unit (for example, a chapter or a section) that includes a plurality of partial documents of the minimum unit.

このような問題に対応すべく、検索文字列を含む最小単位（たとえば、文）の部分文書を全て検索するだけでなく、検索文字列に関連する様々な単位（たとえば、章、節、項など）の部分文書も全て検索する方法が考案されているが、この場合、検索された部分文書の中には、検索文字列との関連性が低く、ユーザが意図したものとはかけ離れている部分文書も多く含まれる。 In order to deal with such problems, not only all partial documents of the minimum unit (for example, sentence) including the search string are searched, but also various units (for example, chapters, sections, terms, etc.) related to the search string ) Has also been devised, but in this case, some of the searched partial documents have a low relevance to the search character string and are far from what the user intended. Many documents are included.

また、上記特許文献１に記載の従来技術にあっては、関連性の強弱を考慮した検索処理をおこなっていないため、より検索文字列との関連性の強い部分文書のみを検索することや、検索された部分文書を検索文字列との関連性の強弱に応じた順序で表示することができない。 Further, in the prior art described in the above-mentioned Patent Document 1, since the search processing considering the strength of relevance is not performed, it is possible to search only a partial document having a stronger relevance to the search character string, The retrieved partial documents cannot be displayed in the order corresponding to the strength of the relationship with the search character string.

このように、上述した従来技術においては、適切な単位および数の部分文書を検索することができないだけでなく、検索された部分文書を適切な順序で表示することができないといった問題が生じていた。 As described above, in the above-described conventional technology, there is a problem that not only the partial documents of an appropriate unit and number cannot be searched but also the searched partial documents cannot be displayed in an appropriate order. .

この発明は、上述した従来技術による問題点を解消するため、適切な単位および数の部分文書を検索したうえ、検索された部分文書を適切な順序で表示することによって、文書検索処理における検索精度およびユーザビリティの向上を図ることができる文書検索装置、文書検索方法および文書検索プログラムを提供することを目的とする。 In order to solve the above-described problems caused by the prior art, the present invention retrieves partial documents of an appropriate unit and number, and displays the retrieved partial documents in an appropriate order. It is another object of the present invention to provide a document search apparatus, a document search method, and a document search program capable of improving usability.

上述した課題を解決し、目的を達成するため、この発明にかかる文書検索装置は、階層構造化された文書セットから、自然文により入力された検索条件に合致するノードを検索する文書検索装置であって、前記文書セットを取得する取得手段と、前記取得手段によって取得された文書セットからノードリストを生成する生成手段と、前記検索条件の入力を受け付ける入力手段と、前記生成手段によって生成されたノードリストに示されているノードごとに、前記入力手段によって入力された検索条件に基づいた、前記検索条件の合致度を示すスコアを算出する算出手段と、前記生成手段によって生成されたノードリストに示されているノードごとに、所定の適合条件を満たすか否かを判断する判断手段と、前記判断手段によって所定の適合条件を満たすと判断されたノードのスコアを、当該ノードが属する親ノードのスコアに加算する加算手段と、前記加算手段によって加算されたスコアと、前記算出手段によって算出されたスコアと、に基づいて、前記生成手段によって生成されたノードリストの中から、前記検索条件の合致度が高いノードを検索結果として決定する決定手段と、を備えたことを特徴とする。 In order to solve the above-described problems and achieve the object, a document search apparatus according to the present invention is a document search apparatus that searches a hierarchically-structured document set for a node that matches a search condition input by a natural sentence. An acquisition unit for acquiring the document set, a generation unit for generating a node list from the document set acquired by the acquisition unit, an input unit for receiving input of the search condition, and the generation unit For each node indicated in the node list, a calculation means for calculating a score indicating the degree of match of the search condition based on the search condition input by the input means, and a node list generated by the generation means A determination means for determining whether or not a predetermined conformity condition is satisfied for each indicated node; and a predetermined conformance condition by the determination means. Based on the addition means for adding the score of the node determined to satisfy the score of the parent node to which the node belongs, the score added by the addition means, and the score calculated by the calculation means, And determining means for determining, as a search result, a node having a high degree of matching with the search condition from the node list generated by the generating means.

この発明によれば、文書検索処理において、テキストノードを持つノードが所定の適合条件を満たしている場合、このノードが属する親ノードを、検索条件の合致度が高いノードとして扱うことができる。 According to the present invention, in a document search process, when a node having a text node satisfies a predetermined matching condition, the parent node to which this node belongs can be handled as a node having a high matching degree of the search condition.

また、この発明にかかる文書検索装置は、上記に記載の発明において、前記決定手段によって検索結果として決定されたノードを、前記検索条件の合致度が高い順に表示されるよう出力を制御する出力制御手段をさらに備えたことを特長とする In the document search device according to the present invention, in the above-described invention, output control for controlling output so that nodes determined as search results by the determination unit are displayed in descending order of matching degree of the search conditions. Characterized by further providing means

この発明によれば、文書セットの中から検索された検索条件の合致度が高いノードを、適切な順序で表示されるよう出力を制御することができる。 According to the present invention, it is possible to control output so that nodes having a high degree of matching of search conditions searched from a document set are displayed in an appropriate order.

また、この発明にかかる文書検索装置は、上記に記載の発明において、前記決定手段は、前記加算手段によって加算されたスコアと、前記算出手段によって算出されたスコアと、に基づいて、前記生成手段によって生成されたノードリストを、前記検索条件の合致度が高い順にソートし、ソートされたノードリストの中から、上位から所定数のノードを検索結果として決定することを特徴とする。 In the document search device according to the present invention as set forth in the invention described above, the determining means is configured to generate the generating means based on the score added by the adding means and the score calculated by the calculating means. The node list generated by the above is sorted in descending order of matching degree of the search condition, and a predetermined number of nodes are determined as search results from the top of the sorted node list.

この発明によれば、検索条件の合致度がより高いノードを、必要な数だけ文書セットの中から検索することができる。 According to the present invention, a necessary number of nodes having a higher matching degree of search conditions can be searched from the document set.

また、この発明にかかる文書検索装置は、上記に記載の発明において、前記算出手段は、ＴＦ−ＩＤＦ法を用いて、前記生成手段によって生成されたノードリストに示されているノードごとに、前記入力手段によって入力された検索条件に基づいた、検索条件の合致度を示すスコアを算出することを特徴とする。 Further, in the document search device according to the present invention, in the invention described above, the calculation means uses the TF-IDF method for each node indicated in the node list generated by the generation means. A score indicating the degree of match of the search condition is calculated based on the search condition input by the input means.

この発明によれば、文書検索処理において、ＴＦ−ＩＤＦ法を用いてスコアを算出することにより、単に検索条件に含まれるキーワードが多く出現するノードではなく、そのキーワードをノードの特徴的なものとするノードを、検索条件の合致度が高いノードとして扱うことができる。 According to the present invention, in the document search process, by calculating the score using the TF-IDF method, the keyword is not simply a node in which many keywords included in the search condition appear, but the keyword is characterized by the node. Can be handled as a node having a high degree of matching with the search condition.

また、この発明にかかる文書検索方法は、階層構造化された文書セットから、自然文により入力された検索条件に合致するノードを検索する文書検索方法であって、前記文書セットを取得する取得工程と、前記取得工程によって取得された文書セットからノードリストを生成する生成工程と、前記検索条件の入力を受け付ける入力工程と、前記生成工程によって生成されたノードリストに示されているノードごとに、前記入力工程によって入力された検索条件に基づいた、前記検索条件の合致度を示すスコアを算出する算出工程と、前記生成工程によって生成されたノードリストに示されているノードごとに、所定の適合条件を満たすか否かを判断する判断工程と、前記判断工程によって所定の適合条件を満たすと判断されたノードのスコアを、当該ノードが属する親ノードのスコアに加算する加算工程と、前記加算工程によって加算されたスコアと、前記算出工程によって算出されたスコアと、に基づいて、前記生成工程によって生成されたノードリストの中から、前記検索条件の合致度が高いノードを検索結果として決定する決定工程と、をコンピュータに実行させることを特徴とする。 The document search method according to the present invention is a document search method for searching a node that matches a search condition input by a natural sentence from a hierarchically structured document set, and acquiring the document set And a generation step of generating a node list from the document set acquired by the acquisition step, an input step of receiving input of the search condition, and for each node indicated in the node list generated by the generation step, Based on the search condition input in the input step, a calculation step for calculating a score indicating the degree of match of the search condition, and a predetermined match for each node indicated in the node list generated by the generation step A determination step for determining whether or not a condition is satisfied, and a score of a node determined to satisfy a predetermined conformity condition by the determination step The node list generated by the generation step based on the addition step of adding to the score of the parent node to which the node belongs, the score added by the addition step, and the score calculated by the calculation step A determination step of determining, as a search result, a node having a high degree of matching with the search condition is executed by a computer.

この発明によれば、文書検索処理において、テキストノードを持つノードが所定の適合条件を満たしている場合、このノードが属する親ノードを、検索条件の合致度が高いノードとしてコンピュータに扱わせることができる。 According to the present invention, in a document search process, when a node having a text node satisfies a predetermined matching condition, the parent node to which this node belongs can be handled by the computer as a node having a high matching degree of the search condition. it can.

また、この発明にかかる文書検索プログラムは、階層構造化された文書セットか、自然文により入力された検索条件に合致するノードを検索する文書検索プログラムであって、前記文書セットを取得する取得工程と、前記取得工程によって取得された文書セットからノードリストを生成する生成工程と、前記検索条件の入力を受け付ける入力工程と、前記生成工程によって生成されたノードリストに示されているノードごとに、前記入力工程によって入力された検索条件に基づいた、前記検索条件の合致度を示すスコアを算出する算出工程と、前記生成工程によって生成されたノードリストに示されているノードごとに、所定の適合条件を満たすか否かを判断する判断工程と、前記判断工程によって所定の適合条件を満たすと判断されたノードのスコアを、当該ノードが属する親ノードのスコアに加算する加算工程と、前記加算工程によって加算されたスコアと、前記算出工程によって算出されたスコアと、に基づいて、前記生成工程によって生成されたノードリストの中から、前記検索条件の合致度が高いノードを検索結果として決定する決定工程と、をコンピュータに実行させることを特徴とする。 The document search program according to the present invention is a document search program for searching a hierarchically-structured document set or a node that matches a search condition input by a natural sentence, and acquiring the document set And a generation step of generating a node list from the document set acquired by the acquisition step, an input step of receiving input of the search condition, and for each node indicated in the node list generated by the generation step, Based on the search condition input in the input step, a calculation step for calculating a score indicating the degree of match of the search condition, and a predetermined match for each node indicated in the node list generated by the generation step A determination step for determining whether or not a condition is satisfied, and a no determination that the predetermined determination condition is satisfied by the determination step. Is generated by the generating step based on the addition step of adding the score of the parent node to the score of the parent node to which the node belongs, the score added by the adding step, and the score calculated by the calculating step A determination step of determining, as a search result, a node having a high degree of matching with the search condition from the node list is executed by a computer.

本発明にかかる文書検索装置、文書検索方法および文書検索プログラムによれば、適切な単位および数の部分文書を検索したうえ、検索された部分文書を適切な順序で表示することによって、文書検索処理における検索精度およびユーザビリティの向上を図ることができるという効果を奏する。 According to the document search device, the document search method, and the document search program according to the present invention, the document search processing is performed by searching for the appropriate unit and number of partial documents and displaying the searched partial documents in an appropriate order. It is possible to improve search accuracy and usability.

以下に添付図面を参照して、この発明にかかる文書検索装置、文書検索方法および文書検索プログラムの好適な実施の形態を、階層構造化された文書セットの一例としてＸＭＬ文書を用いて詳細に説明する。 Exemplary embodiments of a document search device, a document search method, and a document search program according to the present invention will be described below in detail with reference to the accompanying drawings using an XML document as an example of a hierarchically structured document set. To do.

（文書検索装置１００のハードウェア構成）
まず、この実施の形態にかかる文書検索装置のハードウェア構成について説明する。図１は、この実施の形態にかかる文書検索装置のハードウェア構成の一例を示すブロック図である。 (Hardware configuration of document search apparatus 100)
First, the hardware configuration of the document search apparatus according to this embodiment will be described. FIG. 1 is a block diagram showing an example of a hardware configuration of the document search apparatus according to this embodiment.

図１において、文書検索装置１００は、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）１０１と、ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）１０２と、ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）１０３と、ＨＤＤ（ＨａｒｄＤｉｓｃＤｒｉｖｅ）１０４と、ＨＤ（ＨａｒｄＤｉｓｃ）１０５と、ＦＤＤ（ＦｌｅｘｉｂｌｅＤｉｓｃＤｒｉｖｅ）１０６と、ＦＤ（ＦｌｅｘｉｂｌｅＤｉｓｃ）１０７と、ＣＤ−ＲＷ（ＣｏｍｐａｃｔＤｉｓｃＲｅＷｒｉｔａｂｌｅ）ドライブ１０８と、ＣＤ−ＲＷ１０９と、ディスプレイ１１０と、キーボード１１１と、マウス１１２と、ネットワークＩ／Ｆ（インタフェース）１１３と、通信ケーブル１１４と、バス１２０とを備えて構成されている。 In FIG. 1, a document search apparatus 100 includes a CPU (Central Processing Unit) 101, a ROM (Read Only Memory) 102, a RAM (Random Access Memory) 103, an HDD (Hard Disc Drive) 104, and an HD (Hard Disc). ) 105, FDD (Flexible Disc Drive) 106, FD (Flexible Disc) 107, CD-RW (Compact Disc ReWritable) drive 108, CD-RW 109, display 110, keyboard 111, mouse 112, A network I / F (interface) 113, a communication cable 114, and a bus 120 are provided.

ＣＰＵ１０１は、文書検索装置１００全体を制御する。ＲＯＭ１０２は、各種制御プログラムなどを格納する。ＲＡＭ１０３は、可変的なデータを書き換え自在に記憶し、ＣＰＵ１０１のワークエリアとして機能する。ＨＤＤ１０４は、ＣＰＵ１０１の制御にしたがってＨＤ１０５に対するデータのリード／ライトを制御する。ＨＤ１０５は、ＨＤＤ１０４の制御にしたがって書き込まれたデータを記憶する。 The CPU 101 controls the entire document search apparatus 100. The ROM 102 stores various control programs and the like. The RAM 103 stores variable data in a rewritable manner and functions as a work area for the CPU 101. The HDD 104 controls reading / writing of data with respect to the HD 105 according to the control of the CPU 101. The HD 105 stores data written according to the control of the HDD 104.

ＦＤＤ１０６は、ＣＰＵ１０１の制御にしたがってＦＤ１０７に対するデータのリード／ライトを制御する。ＦＤ１０７は、着脱自在であり、ＦＤＤ１０６の制御にしたがって書き込まれたデータを記憶する。ＣＤ−ＲＷドライブ１０８は、ＣＰＵ１０１の制御にしたがってＣＤ−ＲＷ（または、ＣＤ−Ｒ、ＣＤ−ＲＯＭ）１０９に対するデータのリード／ライトを制御する。ＣＤ−ＲＷ１０９は、着脱自在であり、ＣＤ−ＲＷドライブ１０８の制御にしたがって書き込まれたデータを記憶する。 The FDD 106 controls reading / writing of data with respect to the FD 107 according to the control of the CPU 101. The FD 107 is detachable and stores data written according to the control of the FDD 106. The CD-RW drive 108 controls reading / writing of data with respect to the CD-RW (or CD-R, CD-ROM) 109 according to the control of the CPU 101. The CD-RW 109 is detachable and stores data written according to the control of the CD-RW drive 108.

ディスプレイ１１０は、カーソル、メニュー、ウィンドウ、あるいは文字や画像などの各種データを表示する。キーボード１１１は、文字、数値、各種指示などの入力のための複数のキーを備える。マウス１１２は、各種指示の選択や実行、処理対象の選択、マウスポインタの移動などを行う。ネットワークＩ／Ｆ１１３は、通信ケーブル１１４を介してＬＡＮ、ＷＡＮ、インターネットなどのネットワークに接続され、当該ネットワークとＣＰＵ１０１とのインタフェースとして機能する。バス１２０は上記各部を接続する。 The display 110 displays a cursor, a menu, a window, or various data such as characters and images. The keyboard 111 includes a plurality of keys for inputting characters, numerical values, various instructions, and the like. The mouse 112 selects and executes various instructions, selects a processing target, moves a mouse pointer, and the like. A network I / F 113 is connected to a network such as a LAN, WAN, or the Internet via a communication cable 114, and functions as an interface between the network and the CPU 101. A bus 120 connects the above-described units.

（文書検索装置１００の機能的構成）
つぎに、この実施の形態にかかる文書検索装置１００の機能的構成について説明する。図２は、この実施の形態にかかる文書検索装置１００の機能的構成を示すブロック図である。 (Functional configuration of document search apparatus 100)
Next, a functional configuration of the document search apparatus 100 according to this embodiment will be described. FIG. 2 is a block diagram showing a functional configuration of the document search apparatus 100 according to this embodiment.

図２に示すように、文書検索装置１００は、取得部２０１と、生成部２０２と、入力部２０３と、算出部２０４と、判断部２０５と、加算部２０６と、決定部２０７と、出力制御部２０８と、表示部２０９と、を備えて構成されている。 As shown in FIG. 2, the document search apparatus 100 includes an acquisition unit 201, a generation unit 202, an input unit 203, a calculation unit 204, a determination unit 205, an addition unit 206, a determination unit 207, and output control. A unit 208 and a display unit 209 are provided.

取得部２０１は、ＸＭＬ文書を取得する。たとえば取得部２０１は、ユーザによって指定されたＸＭＬ文書ファイルを読み取ることによってＸＭＬ文書を取得する。この場合、ＸＭＬ文書ファイルは、文書検索装置１００内部に記憶されているものに限らず、たとえば、文書検索装置１００と接続された他の装置に記憶されているものであってもよい。取得部２０１は、具体的には、たとえば図１に示したＲＯＭ１０２、ＲＡＭ１０３、ＨＤ１０５、ＦＤ１０７に記憶されたプログラムをＣＰＵ１０１が実行することによってその機能を実現する。 The acquisition unit 201 acquires an XML document. For example, the acquisition unit 201 acquires an XML document by reading an XML document file designated by the user. In this case, the XML document file is not limited to the one stored in the document search device 100, and may be stored in another device connected to the document search device 100, for example. Specifically, the acquisition unit 201 realizes its function by the CPU 101 executing a program stored in the ROM 102, RAM 103, HD 105, and FD 107 shown in FIG.

生成部２０２は、取得部２０１によって取得されたＸＭＬ文書からノードリストを生成する。ここでいうノードリストとは、木構造にモデル化されたＸＭＬ文書に基づいてＸＭＬ文書内に存在する全ての要素ノードをリスト化したものであり、各要素ノードごとに、たとえば、インデックス、パスなどの情報を含む。また、要素ノードにテキストノードが属している場合は、そのテキストノードの、インデックス、テキストなどの情報が関連付けられる。生成部２０２によって生成されたノードリストは、たとえば各ノードのパスが示されたリスト形式でメモリ上に一時的に記憶される。 The generation unit 202 generates a node list from the XML document acquired by the acquisition unit 201. The node list here is a list of all element nodes existing in the XML document based on the XML document modeled in a tree structure. For example, for each element node, an index, a path, etc. Contains information. When a text node belongs to an element node, information such as an index and text of the text node is associated. The node list generated by the generation unit 202 is temporarily stored in the memory in a list format in which the path of each node is shown, for example.

なお、生成部２０２は、ＸＭＬ文書に存在する全てのノードに関するノードリストを生成するだけでなく、所定の範囲内のノードに関するノードリストや、ユーザによって指定された範囲内のノードに関するノードリストを生成するようにしてもよい。また、ノードリストの具体的な生成手順については、図４を用いて後述する。生成部２０２は、具体的には、たとえば図１に示したＲＯＭ１０２、ＲＡＭ１０３、ＨＤ１０５、ＦＤ１０７に記憶されたプログラムをＣＰＵ１０１が実行することによってその機能を実現する。 The generation unit 202 not only generates a node list for all nodes existing in the XML document, but also generates a node list for nodes within a predetermined range and a node list for nodes within a range specified by the user. You may make it do. A specific procedure for generating the node list will be described later with reference to FIG. Specifically, the generation unit 202 realizes its function by the CPU 101 executing programs stored in the ROM 102, RAM 103, HD 105, and FD 107 shown in FIG.

入力部２０３は、検索条件（検索クエリ文）の入力を受け付ける。たとえば、検索条件は「Ｊ社Ｉ太郎」や「（Ｊ社Ｉ太郎）」のように入力され、前者は、「Ｊ社」および「Ｉ太郎」の両方を含む、を意味し、後者は、「Ｊ社」または「Ｉ太郎」のいずれかを含む、を意味する。ここで、検索条件は、ユーザが文書検索装置１００に直接入力したものに限らず、たとえば、文書検索装置１００と接続された他の装置から送信されたものであってもよい。入力部２０３は、具体的には、たとえば図１に示したキーボード１１１、マウス１１２、ネットワークＩ／Ｆ１１３などによってその機能を実現する。 The input unit 203 receives an input of a search condition (search query sentence). For example, the search condition is input as “J company I Taro” or “(J company I Taro)”, the former means that both “J company” and “I Taro” are included, and the latter is It means that either “Company J” or “I Taro” is included. Here, the search conditions are not limited to those directly input to the document search apparatus 100 by the user, but may be those transmitted from other apparatuses connected to the document search apparatus 100, for example. Specifically, the input unit 203 realizes its function by, for example, the keyboard 111, the mouse 112, the network I / F 113, and the like shown in FIG.

算出部２０４は、生成部２０２によって生成されたノードリストに示されているノードごとに、入力部２０３によって入力された検索条件に基づいた、検索条件の合致度を示すスコアを算出する。スコア（ＴＦ−ＩＤＦ）は、以下算出式（１）により求めることができる。 The calculation unit 204 calculates a score indicating the matching degree of the search condition based on the search condition input by the input unit 203 for each node indicated in the node list generated by the generation unit 202. The score (TF-IDF) can be obtained by the following calculation formula (1).

ＴＦＩＤＦ＝ＴＦ×ｌｏｇ（Ｎ／ＤＦ）・・・（１） TFIDF = TF × log (N / DF) (1)

上記算出式（１）において、ＴＦは、テキストノード内における検索文字列の出現数を示す。また、Ｎは、全テキストノード数を示す。そして、ＤＦは、検索文字列を含むテキストノード数を示す。 In the calculation formula (1), TF indicates the number of appearances of the search character string in the text node. N indicates the total number of text nodes. DF indicates the number of text nodes including the search character string.

なお、本実施の形態においては、ＴＦ−ＩＤＦ法を用いてスコアを算出しているが、これに限らず、他の方法を用いて、スコアを算出するようにしてもよい。算出部２０４は、具体的には、たとえば図１に示したＲＯＭ１０２、ＲＡＭ１０３、ＨＤ１０５、ＦＤ１０７に記憶されたプログラムをＣＰＵ１０１が実行することによってその機能を実現する。 In the present embodiment, the score is calculated using the TF-IDF method. However, the present invention is not limited to this, and the score may be calculated using another method. Specifically, the calculation unit 204 realizes its function when the CPU 101 executes programs stored in the ROM 102, the RAM 103, the HD 105, and the FD 107 shown in FIG.

判断部２０５は、生成部２０２によって生成されたノードリストに示されているノードごとに、算出部２０４によって算出されたスコアに基づいて、所定の適合条件を満たすか否かを判断する。適合条件としては、たとえば、「スコアが所定以上または所定値未満の場合であるか否か」、「他の同位ノードのスコアとの合計スコアは所定値以上となるか否か」、「他の同位ノードとの合計ノード数が所定値以下か否か」、などが挙げられるが、これに限らず、他の適合条件を用いてもよい。また、適合条件は、あらかじめ設定されているものであってもよく、ユーザによって指定されたものであってもよい。判断部２０５は、具体的には、たとえば図１に示したＲＯＭ１０２、ＲＡＭ１０３、ＨＤ１０５、ＦＤ１０７に記憶されたプログラムをＣＰＵ１０１が実行することによってその機能を実現する。 The determination unit 205 determines whether a predetermined matching condition is satisfied based on the score calculated by the calculation unit 204 for each node indicated in the node list generated by the generation unit 202. Examples of the matching conditions include “whether the score is greater than or equal to a predetermined value or less than a predetermined value”, “whether the total score with the scores of other peer nodes is equal to or greater than a predetermined value”, “other Whether or not the total number of nodes with the peer node is equal to or less than a predetermined value ”is not limited to this, but other matching conditions may be used. In addition, the matching condition may be set in advance or may be specified by the user. Specifically, the determination unit 205 realizes its function when the CPU 101 executes a program stored in the ROM 102, the RAM 103, the HD 105, and the FD 107 shown in FIG.

加算部２０６は、判断部２０５によって所定の適合条件を満たすと判断されたノードのスコアを、このノードが属する親ノードのスコアに加算する。たとえば、所定の適合条件を満たすと判断されたノード「Ａ／Ｂ／Ｄ」のスコアが「５」であり、このノードが属する親ノード「Ａ／Ｂ」のスコアが「５」であった場合、加算部２０６による加算処理によって、ノード「Ａ／Ｂ」のスコアのスコアは「１０」となる。加算部２０６は、上記加算処理を、根ノード（すなわち、階層の深いノード）からルートノードに向かって、順に、生成部２０２によって生成されたノードリストに示されている全てのノードについておこなう。加算部２０６は、具体的には、たとえば図１に示したＲＯＭ１０２、ＲＡＭ１０３、ＨＤ１０５、ＦＤ１０７に記憶されたプログラムをＣＰＵ１０１が実行することによってその機能を実現する。 The adding unit 206 adds the score of the node determined by the determining unit 205 as satisfying a predetermined matching condition to the score of the parent node to which this node belongs. For example, when the score of the node “A / B / D” determined to satisfy the predetermined conformity condition is “5” and the score of the parent node “A / B” to which this node belongs is “5” The score of the node “A / B” becomes “10” by the addition processing by the adding unit 206. The addition unit 206 performs the above addition processing on all the nodes indicated in the node list generated by the generation unit 202 in order from the root node (that is, a deep node) to the root node. Specifically, the adding unit 206 realizes its function by the CPU 101 executing a program stored in the ROM 102, the RAM 103, the HD 105, and the FD 107 shown in FIG.

決定部２０７は、加算部２０６によって加算されたスコアと、算出部２０４によって算出されたスコアと、に基づいて、生成部２０２によって生成されたノードリストを検索条件の合致度が高い順にソートし、ソートされたノードリストの中から、上位から所定数のノードを検索結果として決定する。 The determination unit 207 sorts the node list generated by the generation unit 202 based on the score added by the addition unit 206 and the score calculated by the calculation unit 204 in descending order of matching degree of the search condition, A predetermined number of nodes from the top of the sorted node list are determined as search results.

なお、決定部２０７によって検索結果として決定されるノードの検索数は、あらかじめ設定されているものに限らず、たとえば、ユーザによって指定されたものであってもよい。決定部２０７は、具体的には、たとえば図１に示したＲＯＭ１０２、ＲＡＭ１０３、ＨＤ１０５、ＦＤ１０７に記憶されたプログラムをＣＰＵ１０１が実行することによってその機能を実現する。 Note that the number of node searches determined as a search result by the determination unit 207 is not limited to that set in advance, and may be specified by the user, for example. Specifically, the determination unit 207 realizes its function by the CPU 101 executing a program stored in the ROM 102, RAM 103, HD 105, and FD 107 shown in FIG.

出力制御部２０８は、決定部２０７によって検索結果として決定されたノードが、検索条件の合致度が高い順に表示部２０９に表示されるよう出力を制御する。なお、出力制御部２０８は、決定部２０７によって検索結果として決定されたノードを表示するように制御するだけでなく、たとえば、ファイルに出力するように制御したり、文書検索装置１００と接続された他の装置へ送信するように制御してもよい。出力制御部２０８は、具体的には、たとえば図１に示したＲＯＭ１０２、ＲＡＭ１０３、ＨＤ１０５、ＦＤ１０７に記憶されたプログラムをＣＰＵ１０１が実行することによってその機能を実現する。 The output control unit 208 controls the output so that the nodes determined as the search results by the determination unit 207 are displayed on the display unit 209 in descending order of matching degree of the search conditions. Note that the output control unit 208 not only controls to display the node determined as the search result by the determination unit 207, but also controls to output to a file or connected to the document search device 100, for example. You may control to transmit to another apparatus. Specifically, the output control unit 208 realizes its function when the CPU 101 executes a program stored in the ROM 102, RAM 103, HD 105, and FD 107 shown in FIG.

表示部２０９は、出力制御部２０８の制御によって、決定部２０７によって検索結果として決定されたノードを、検索条件の合致度が高い順に表示する。表示部２０９は、具体的には、たとえば図１に示したディスプレイ１１０によってその機能を実現する。 The display unit 209 displays the nodes determined as the search results by the determination unit 207 under the control of the output control unit 208 in descending order of the matching degree of the search conditions. Specifically, the display unit 209 realizes its function by the display 110 shown in FIG. 1, for example.

（ＸＭＬ文書の一例）
つぎに、この発明の実施の形態にかかる文書検索装置１００に用いられるＸＭＬ文書の一例について説明する。図３は、この発明の実施の形態にかかる文書検索装置１００に用いられるＸＭＬ文書の一例を示す説明図である。 (Example of XML document)
Next, an example of an XML document used in the document search apparatus 100 according to the embodiment of the present invention will be described. FIG. 3 is an explanatory diagram showing an example of an XML document used in the document search apparatus 100 according to the embodiment of the present invention.

図３は、木構造にモデル化されたＸＭＬ文書「ｃ：￥ｄｏｃｕｍｅｎｔｓ￥０１２３．ｘｍｌ」を示したものである。図３において、ノード１〜ノード１１は、要素ノードを示し、各数字「１」〜「１１」はインデックスを示す。また、ノードＡ〜Ｅは、テキストノードを示し、各英字「Ａ」〜「Ｅ」はインデックスを示す。 FIG. 3 shows an XML document “c: ¥ documents ¥ 0123.xml” modeled in a tree structure. In FIG. 3, nodes 1 to 11 indicate element nodes, and numerals “1” to “11” indicate indexes. Nodes A to E indicate text nodes, and the letters “A” to “E” indicate indexes.

図３において、たとえば、要素ノード４にはテキストノードＡが属している。また、テキストノードＡは、テキスト「ＸＭＬ，ｓｃｈｅｍｅ」を持つ。たとえば、このテキストノードＡをタグを用いて示した場合、「＜ａｒｔｉｃｌｅ＞＜ｂｏｄｙ＞＜ｓｅｃ＞＜ｐ１＞ＸＭＬ，ｓｃｈｅｍｅ＜／ｐ１＞＜／ｓｅｃ＞＜／ｂｏｄｙ＞＜／ａｒｔｉｃｌｅ＞」と示すことができる。 In FIG. 3, for example, the text node A belongs to the element node 4. The text node A has the text “XML, scheme”. For example, when this text node A is indicated using a tag, it is indicated as “<article> <body> <sec> <p1> XML, scheme </ p1> </ sec> </ body> </ article>”. be able to.

（生成部２０２によるノードリストの生成手順）
つぎに、生成部２０２によるノードリストの生成手順について説明する。図４は、生成部２０２によるノードリストの生成手順の一例を示すフローチャートである。 (Node list generation procedure by the generation unit 202)
Next, a node list generation procedure by the generation unit 202 will be described. FIG. 4 is a flowchart illustrating an example of a node list generation procedure by the generation unit 202.

まず、木構造にモデル化されたＸＭＬ文書の中から、要素ノードを一つ選択する（ステップＳ４０１）。最初は、最上位の要素ノードを選択する。たとえば、図３に示したＸＭＬ文書の場合、要素ノード１が選択される。 First, one element node is selected from an XML document modeled in a tree structure (step S401). First, the highest element node is selected. For example, in the case of the XML document shown in FIG. 3, element node 1 is selected.

つぎに、ステップＳ４０１で選択された要素ノードをノードリストに追加する（ステップＳ４０２）。ここで、ノードリストに追加される情報は、要素ノードのインデックスやパスなどである。たとえば、図３に示したＸＭＬ文書における要素ノード１の場合は、インデックス「１」やパス「／ａｒｔｉｃｌｅ」などである。 Next, the element node selected in step S401 is added to the node list (step S402). Here, the information added to the node list includes the index and path of the element node. For example, in the case of the element node 1 in the XML document shown in FIG. 3, the index is “1”, the path is “/ article”, and the like.

つぎに、ステップＳ４０１で選択された要素ノードにテキストノードが属しているか否かを判断する（ステップＳ４０３）。たとえば、図３に示したＸＭＬ文書における要素ノード１の場合は、テキストノードが属していないと判断され、要素ノード４の場合は、テキストノードが属していると判断される。 Next, it is determined whether or not a text node belongs to the element node selected in step S401 (step S403). For example, in the case of element node 1 in the XML document shown in FIG. 3, it is determined that the text node does not belong, and in the case of element node 4, it is determined that the text node belongs.

ステップＳ４０３において、テキストノードが属していると判断した場合（ステップＳ４０３：Ｙｅｓ）は、ステップＳ４０１で選択された要素ノードと、この要素ノードに属しているテキストノードとの関連付けをおこなって（ステップＳ４０４）、ステップＳ４０５へ進む。 If it is determined in step S403 that the text node belongs (step S403: Yes), the element node selected in step S401 is associated with the text node belonging to this element node (step S404). ), The process proceeds to step S405.

ここで、要素ノードに関連付けられる情報は、テキストノードのインデックスやテキストなどである。たとえば、図３に示したＸＭＬ文書におけるテキストノードＡの場合は、インデックス「Ａ」やテキスト「ＸＭＬ，ｓｃｈｅｍｅ」などである。一方、ステップＳ４０３において、テキストノードが属していないと判断した場合（ステップＳ４０３：Ｎｏ）は、ステップＳ４０４を飛ばして、ステップＳ４０５へ進む。 Here, the information associated with the element node is a text node index or text. For example, in the case of the text node A in the XML document shown in FIG. 3, the index “A”, the text “XML, scheme”, and the like. On the other hand, if it is determined in step S403 that the text node does not belong (step S403: No), step S404 is skipped and the process proceeds to step S405.

つぎに、ＸＭＬ文書に含まれる全ての要素ノードが選択されたか否かを判断する（ステップＳ４０５）。ステップＳ４０５において、全ての要素ノードが選択されたと判断した場合（ステップＳ４０５：Ｙｅｓ）は、一連の処理を終了する。一方、ステップＳ４０５において、全ての要素ノードが選択されていないと判断した場合（ステップＳ４０５：Ｎｏ）は、ＸＭＬ文書において、ステップＳ４０１で選択された要素ノードを基準に、次の要素ノードを選択する（ステップＳ４０６）。このとき、下位ノードと同位ノードが存在する場合は下位ノードを優先して選択する。たとえば、図３に示したＸＭＬ文書において、要素ノードが選択される順番は、インデックス番号の順番とおりとなる。 Next, it is determined whether or not all element nodes included in the XML document have been selected (step S405). If it is determined in step S405 that all element nodes have been selected (step S405: Yes), the series of processing ends. On the other hand, if it is determined in step S405 that all element nodes have not been selected (step S405: No), the next element node is selected based on the element node selected in step S401 in the XML document. (Step S406). At this time, if a lower node and a peer node exist, the lower node is selected with priority. For example, in the XML document shown in FIG. 3, the order in which element nodes are selected is the order of the index numbers.

そして、ステップＳ４０２に戻り、ステップＳ４０５で全てのノードが選択されたと判断されるまで、ステップＳ４０２〜ステップＳ４０６を繰り返しおこなう。これにより、ＸＭＬ文書に含まれる全ての要素ノードをノードリストに追加することができる。また、ＸＭＬ文書に含まれる全てのテキストノードを、それぞれ、ノードリストに示された要素ノードのいずれかと関連付けることができる。 Then, the process returns to step S402, and steps S402 to S406 are repeated until it is determined in step S405 that all nodes have been selected. Thereby, all the element nodes included in the XML document can be added to the node list. Further, all text nodes included in the XML document can be associated with any one of the element nodes shown in the node list.

（生成部２０２によって生成されたノードリストの一例）
つぎに、生成部２０２によって生成されたノードリストの一例について説明する。図５は、生成部２０２によって生成されたノードリストの一例を示す説明図である。 (Example of node list generated by the generation unit 202)
Next, an example of the node list generated by the generation unit 202 will be described. FIG. 5 is an explanatory diagram illustrating an example of a node list generated by the generation unit 202.

図５に示すノードリストは、図４を用いて上述した手順によって、図３に示したＸＭＬ文書から生成されたノードリストであり、列「ｉｎｄｅｘ１」,「ｐａｓｓ」，「ｉｎｄｅｘ２」，「ｔｅｘｔ」によって構成されている。 The node list shown in FIG. 5 is a node list generated from the XML document shown in FIG. 3 by the procedure described above with reference to FIG. 4, and columns “index1”, “pass”, “index2”, “text”. It is constituted by.

このうち、列「ｉｎｄｅｘ１」には、要素ノードのインデックスが設定されている。また、列「ｐａｓｓ」には、要素ノードのパスが設定されている。そして、列「ｉｎｄｅｘ２」には、要素ノードと関連付けられているテキストノードのインデックスが設定されている。さらに、列「ｔｅｘｔ」には、要素ノードと関連付けられているテキストノードのテキストが設定されている。 Among these, the index of the element node is set in the column “index1”. In the column “pass”, an element node path is set. In the column “index2”, the index of the text node associated with the element node is set. Further, the text of the text node associated with the element node is set in the column “text”.

たとえば、図５に示すノードリストから、インデックス「４」が付与されたパス「／ａｒｔｉｃｌｅ／ｂｏｄｙ／ｓｅｃ／ｐ１」によって示される要素ノードには、インデックス「Ａ」が付与され、かつテキスト「ＸＭＬ，ｓｃｈｅｍｅ」を含むテキストノードが関連付けられていると判断することができる。 For example, from the node list shown in FIG. 5, the element node indicated by the path “/ article / body / sec / p1” to which the index “4” is assigned is assigned the index “A” and the text “XML, It can be determined that a text node including “scheme” is associated.

（文書検索装置１００による文書検索処理の手順）
つぎに、この発明の実施の形態にかかる文書検索装置１００による文書検索処理の手順について説明する。図６は、この発明の実施の形態にかかる文書検索装置１００による文書検索処理の手順の一例を示すフローチャートである。 (Procedure for document search processing by the document search apparatus 100)
Next, a procedure for document search processing by the document search apparatus 100 according to the embodiment of the present invention will be described. FIG. 6 is a flowchart showing an example of a procedure of document search processing by the document search apparatus 100 according to the embodiment of the present invention.

まず、取得部２０１によって、ＸＭＬ文書を取得して（ステップＳ６０１）、生成部２０２によって、ステップＳ６０１で取得されたＸＭＬ文書からノードリストを生成する（ステップＳ６０２）。ノードリストの具体的な生成手順については図４を用いて上述したとおりである。 First, the acquisition unit 201 acquires an XML document (step S601), and the generation unit 202 generates a node list from the XML document acquired in step S601 (step S602). The specific procedure for generating the node list is as described above with reference to FIG.

つぎに、入力部２０３によって、検索条件の入力を受け付けて（ステップＳ６０３）、算出部２０４によって、ステップＳ６０２で生成されたノードリストに示されているノードごとに、ステップＳ６０３で入力された検索条件に基づいた、検索条件の合致度を示すスコアを算出する（ステップＳ６０４）。 Next, input of a search condition is accepted by the input unit 203 (step S603), and the search condition input in step S603 for each node indicated in the node list generated in step S602 by the calculation unit 204. Based on the above, a score indicating the degree of match of the search condition is calculated (step S604).

続いて、判断部２０５によって、ステップＳ６０２で生成されたノードリストに示されているノードを一つ選択して（ステップＳ６０５）、ステップＳ６０５で選択されたノードについて、ステップＳ６０４で算出されたスコアに基づいて、所定の適合条件を満たすか否かを判断する（ステップＳ６０６）。 Subsequently, the determination unit 205 selects one node shown in the node list generated in step S602 (step S605), and the node selected in step S605 has the score calculated in step S604. Based on this, it is determined whether or not a predetermined conformity condition is satisfied (step S606).

ステップＳ６０６において、所定の適合条件を満たすと判断した場合（ステップＳ６０６：Ｙｅｓ）は、加算部２０６によって、ステップＳ６０５で選択されたノードのスコアを、このノードが属する親ノードのスコアに加算して（ステップＳ６０７）、ステップＳ６０８へ進む。一方、ステップＳ６０６において、所定の適合条件を満たさないと判断した場合（ステップＳ６０６：Ｎｏ）は、ステップＳ６０７を飛ばして、ステップＳ６０８へ進む。 If it is determined in step S606 that the predetermined conformity condition is satisfied (step S606: Yes), the adder 206 adds the score of the node selected in step S605 to the score of the parent node to which this node belongs. (Step S607), the process proceeds to Step S608. On the other hand, if it is determined in step S606 that the predetermined matching condition is not satisfied (step S606: No), step S607 is skipped and the process proceeds to step S608.

続いて、判断部２０５によって、ステップＳ６０２で生成されたノードリストに示されているノードが全て選択されたか否かを判断する（ステップＳ６０８）。ステップＳ６０８において、ノードが全て選択されていないと判断した場合（ステップＳ６０８：Ｎｏ）は、ステップＳ６０８においてノードが全て選択されたと判断されるまで、ステップＳ６０５〜ステップＳ６０８を繰り返しおこなう。 Subsequently, the determination unit 205 determines whether or not all the nodes shown in the node list generated in step S602 have been selected (step S608). If it is determined in step S608 that all nodes have not been selected (step S608: No), steps S605 to S608 are repeated until it is determined in step S608 that all nodes have been selected.

一方、ステップＳ６０８において、ノードが全て選択されたと判断した場合（ステップＳ６０８：Ｙｅｓ）は、決定部２０７によって、ステップＳ６０７で加算されたスコアと、ステップＳ６０４で算出されたスコアと、に基づいて、ステップＳ６０２で生成されたノードリストを検索条件の合致度が高い順にソートして（ステップＳ６０９）、ソートされたノードリストの中から、上位から所定数のノードを検索結果として決定する（ステップＳ６１０）。 On the other hand, if it is determined in step S608 that all the nodes have been selected (step S608: Yes), based on the score added in step S607 by the determination unit 207 and the score calculated in step S604, The node list generated in step S602 is sorted in descending order of matching degree of the search condition (step S609), and a predetermined number of nodes are determined as search results from the sorted node list (step S610). .

そして、出力制御部２０８の制御によって、ステップＳ６１０で検索結果として決定されたノードを、検索条件の合致度が高い順に表示部２０９に表示して（ステップＳ６１１）、一連の処理を終了する。 Then, under the control of the output control unit 208, the nodes determined as the search results in step S610 are displayed on the display unit 209 in descending order of the degree of match of the search conditions (step S611), and the series of processing ends.

（算出部２０４によって算出されたスコアの一例）
つぎに、算出部２０４によって算出されたスコアの一例について説明する。図７は、算出部２０４によって算出されたスコアの一例を示す説明図である。 (Example of score calculated by calculation unit 204)
Next, an example of the score calculated by the calculation unit 204 will be described. FIG. 7 is an explanatory diagram illustrating an example of a score calculated by the calculation unit 204.

図７は、図５に示したノードリストと、算出部２０４によって算出された各要素ノードのスコアと、の関連付けを示したものである。図７において、列「ｓｃｏｒｅ１」には、算出部２０４によって算出されたスコアが設定されている。このときの、算出処理に用いられた検索文字列は「ＸＭＬ，ｔａｇ，ｓｃｈｅｍｅ」である。 FIG. 7 shows the association between the node list shown in FIG. 5 and the score of each element node calculated by the calculation unit 204. In FIG. 7, the score calculated by the calculation unit 204 is set in a column “score1”. The search character string used in the calculation process at this time is “XML, tag, scheme”.

たとえば、図７から、インデックス「４」が付与されたパス「／ａｒｔｉｃｌｅ／ｂｏｄｙ／ｓｅｃ／ｐ１」によって示される要素ノードには、算出部２０４によって算出されたスコア「３８」が関連付けられていると判断することができる。ここで、このスコア「３８」は以下のＴＦ−ＩＤＦ算出式（２）によって算出されたものである。 For example, from FIG. 7, it is assumed that the score “38” calculated by the calculation unit 204 is associated with the element node indicated by the path “/ article / body / sec / p1” to which the index “4” is assigned. Judgment can be made. Here, this score “38” is calculated by the following TF-IDF calculation formula (2).

３８（ＴＦＩＤＦ：スコア）＝１（ＴＦ：テキストノード内における検索文字列「ＸＭＬ」の出現数）×２０（ＩＤＦ：ｌｏｇ（全テキストノード数／検索文字列「ＸＭＬ」を含むテキストノード数））＋１（ＴＦ：テキストノード内における検索文字列「ｓｃｈｅｍｅ」の出現数）×１８（ＩＤＦ：ｌｏｇ（全テキストノード数／検索文字列「ｓｃｈｅｍｅ」を含むテキストノード数））・・・（２） 38 (TFIDF: score) = 1 (TF: number of occurrences of the search character string “XML” in the text node) × 20 (IDF: log (total number of text nodes / number of text nodes including the search character string “XML”)) +1 (TF: appearance number of search character string “scheme” in text node) × 18 (IDF: log (total number of text nodes / number of text nodes including search character string “scheme”)) (2)

また、インデックス「５」が付与されたパス「／ａｒｔｉｃｌｅ／ｂｏｄｙ／ｓｅｃ／ｐ２」によって示される要素ノードには、算出部２０４によって算出されたスコア「８０」が関連付けられていると判断することができる。こで、このスコア「８０」は以下のＴＦ−ＩＤＦ算出式（３）によって算出されたものである。 Further, it may be determined that the score “80” calculated by the calculation unit 204 is associated with the element node indicated by the path “/ article / body / sec / p2” to which the index “5” is assigned. it can. The score “80” is calculated by the following TF-IDF calculation formula (3).

８０（ＴＦＩＤＦ：スコア）＝２（ＴＦ：テキストノード内における検索文字列「ｔａｇ」の出現数）×４０（ＩＤＦ：ｌｏｇ（全テキストノード数／検索文字列「ｔａｇ」を含むテキストノード数））・・・（３） 80 (TFIDF: score) = 2 (TF: number of occurrences of the search character string “tag” in the text node) × 40 (IDF: log (total number of text nodes / number of text nodes including the search character string “tag”)) ... (3)

なお、算出部２０４によって算出されたスコアは、ノードリストに持たせてもよく、ノードリストとは別のテーブルなどに持たせてもよい。 Note that the score calculated by the calculation unit 204 may be included in the node list, or may be included in a table different from the node list.

（加算部２０６によって加算されたスコアの一例）
つぎに、加算部２０６によって加算されたスコアの一例について説明する。図８は、加算部２０６によって加算されたスコアの一例を示す説明図である。 (Example of score added by adding unit 206)
Next, an example of the score added by the adding unit 206 will be described. FIG. 8 is an explanatory diagram illustrating an example of scores added by the adding unit 206.

図８は、図５に示したノードリストと、算出部２０４によって算出された各要素ノードのスコアと、加算部２０６によって加算されたスコアと、の関連付けを示したものである。図８において、列「ｓｃｏｒｅ２」には、加算部２０６によって加算されたスコアが設定されている。 FIG. 8 shows the association between the node list shown in FIG. 5, the score of each element node calculated by the calculation unit 204, and the score added by the addition unit 206. In FIG. 8, the score added by the adding unit 206 is set in a column “score2”.

たとえば、図８から、インデックス「３」が付与されたパス「／ａｒｔｉｃｌｅ／ｂｏｄｙ／ｓｅｃ」によって示される要素ノードには、加算部２０４によって加算されたスコア「１１８」が関連付けられていると判断することができる。ここで、このスコア「１１８」は、この要素ノードに属する、インデックス「４」が付与された要素ノードのスコア「３８」と、インデックス「５」が付与された要素ノードのスコア「８０」とが加算されたものである。 For example, from FIG. 8, it is determined that the score “118” added by the adding unit 204 is associated with the element node indicated by the path “/ article / body / sec” to which the index “3” is assigned. be able to. Here, the score “118” includes the score “38” of the element node to which the index “4” is assigned and the score “80” of the element node to which the index “5” is assigned. It is an addition.

上記加算処理に先立っては、判断部２０５によって、インデックス「４」および「５」についての、インデックス「３」に加算するか否かの判断処理がおこなわれている。このときの、判断処理に用いられた判断条件は「合計スコア５０以上となる場合は加算する」と、「連結ノード数が１００以下の場合は加算する」である。そして、インデックス「４」および「５」については、上記条件を満たしているため、ともにスコアをインデックス「３」のスコアに加算すると判断されている。 Prior to the addition process, the determination unit 205 determines whether or not the indexes “4” and “5” are added to the index “3”. The determination conditions used in the determination process at this time are “add when the total score is 50 or more” and “add when the number of connected nodes is 100 or less”. Since the indexes “4” and “5” satisfy the above condition, it is determined that the scores are added to the score of the index “3”.

また、図８に示すように、インデックス「４」および「５」のスコアは、加算部２０４によって「０」とされている。これは、インデックス「３」と、インデックス「４」および「５」とが、重複して検索結果として決定されることを避けるためである。 Further, as shown in FIG. 8, the scores of the indexes “4” and “5” are set to “0” by the adding unit 204. This is to prevent the index “3” and the indexes “4” and “5” from being determined as search results in duplicate.

同様に、図８から、インデックス「９」が付与されたパス「／ａｒｔｉｃｌｅ／ｂｏｄｙ／ｓｅｃ／ｔｉｔｌｅ／ｎａｍｅ」によって示される要素ノードには、加算部２０４によって加算されたスコア「６６」が関連付けられていると判断することができる。ここで、このスコア「６６」は、この要素ノードに属する、インデックス「１０」が付与された要素ノードのスコア「２２」と、インデックス「１１」が付与された要素ノードのスコア「４４」とが加算されたものである。そして、インデックス「１０」および「１１」のスコアは、加算部２０４によって「０」とされている。 Similarly, from FIG. 8, the score “66” added by the adding unit 204 is associated with the element node indicated by the path “/ article / body / sec / title / name” to which the index “9” is assigned. Can be determined. Here, the score “66” includes the score “22” of the element node to which the index “10” is assigned and the score “44” of the element node to which the index “11” is assigned. It is an addition. The scores of indexes “10” and “11” are set to “0” by the adding unit 204.

（決定部２０７によってソートされたノードリストの一例）
つぎに、決定部２０７によってソートされたノードリストの一例について説明する。図９は、決定部２０７によってソートされたノードリストの一例を示す説明図である。 (Example of node list sorted by determination unit 207)
Next, an example of the node list sorted by the determination unit 207 will be described. FIG. 9 is an explanatory diagram illustrating an example of a node list sorted by the determination unit 207.

図９は、決定部２０７によって、図８に示したように加算部２０６によって加算されたスコアと、図７に示したように算出部２０４によって算出されたスコアと、に基づいて、図５に示したノードリストを検索条件の合致度が高い順にソートしたものである。このようにソートされたノードリストから、決定部２０７は、上位から所定数のノードを検索結果として決定する。 9 is based on the score added by the adding unit 206 as shown in FIG. 8 by the determining unit 207 and the score calculated by the calculating unit 204 as shown in FIG. The node list shown is sorted in descending order of matching degree of search conditions. From the node list thus sorted, the determination unit 207 determines a predetermined number of nodes as search results from the top.

たとえば、検索数が３件と指定されている場合、決定部２０７は、図９に示したノードリストの中から、インデックス「３」が付与された要素ノードと、インデックス「８」が付与された要素ノードと、インデックス「９」が付与された要素ノードと、の３件の要素ノードを検索結果として決定する。 For example, when the number of searches is specified as three, the determination unit 207 receives the element node to which the index “3” is assigned and the index “8” from the node list illustrated in FIG. Three element nodes, that is, the element node and the element node to which the index “9” is assigned, are determined as search results.

（表示部２０９に表示された検索結果の一例）
つぎに、表示部２０９に表示された検索結果の一例について説明する。図１０は、表示部２０９に表示された検索結果の一例を示す説明図である。 (Example of search result displayed on display unit 209)
Next, an example of the search result displayed on the display unit 209 will be described. FIG. 10 is an explanatory diagram illustrating an example of a search result displayed on the display unit 209.

図１０は、図３に示したＸＭＬ文書に対して、図６を用いて上述した手順による文書検索処理がおこなわれた結果、表示部２０９に表示された検索結果を示すものである。図１０に示すように、文書検索処理をおこなうにあたり、検索対象文書「ｃ：￥ｄｏｃｕｍｅｎｔｓ￥０１２３．ｘｍｌ」、検索条件「ＸＭＬ,ｔａｇ,ｓｃｈｅｍｅ」、検索数「３（件）」がユーザによって指定されている。 FIG. 10 shows a search result displayed on the display unit 209 as a result of performing the document search process according to the procedure described above with reference to FIG. 6 on the XML document shown in FIG. As shown in FIG. 10, in performing the document search process, the search target document “c: ¥ documents ¥ 0123.xml”, the search condition “XML, tag, scheme”, and the number of searches “3 (cases)” are designated by the user. Has been.

そして、「検索」ボタンが押下されたことにより、検索対象文書「ｃ：￥ｄｏｃｕｍｅｎｔｓ￥０１２３．ｘｍｌ」に対する文書検索処理がおこなわれ、その結果として、検索対象文書「ｃ：￥ｄｏｃｕｍｅｎｔｓ￥０１２３．ｘｍｌ」の中から決定された、上位３件のノードが検索結果として表示されている。 When the “search” button is pressed, a document search process is performed on the search target document “c: \ documents \ 0123.xml”. As a result, the search target document “c: \ documents \ 0123.xml” is executed. The top three nodes determined from "are displayed as search results.

以上説明したように、本実施の形態にかかる文書検索装置１００によれば、ＸＭＬ文書を取得し、取得されたＸＭＬ文書からノードリストを生成し、検索条件の入力を受け付け、生成されたノードリストに示されているノードごとに、入力された検索条件に基づいた、検索条件の合致度を示すスコアを算出し、生成されたノードリストに示されているノードごとに、所定の適合条件を満たすか否かを判断し、所定の適合条件を満たすと判断されたノードのスコアを、当該ノードが属する親ノードのスコアに加算し、加算されたスコアと、算出されたスコアと、に基づいて、生成されたノードリストを、検索条件の合致度が高い順にソートし、ソートされたノードリストの中から、上位から所定数のノードを検索結果として決定し、決定されたノードを、検索条件の合致度が高い順に表示する構成とした。 As described above, according to the document search apparatus 100 according to the present embodiment, an XML document is acquired, a node list is generated from the acquired XML document, an input of a search condition is received, and the generated node list For each node indicated in, a score indicating the degree of match of the search condition is calculated based on the input search condition, and a predetermined matching condition is satisfied for each node indicated in the generated node list Whether or not, and the score of the node determined to satisfy the predetermined matching condition is added to the score of the parent node to which the node belongs, and based on the added score and the calculated score, The generated node list is sorted in descending order of matching degree of search conditions, and a predetermined number of nodes are determined as search results from the sorted node list. The over-de, has a structure in which the matching degree of the search condition displayed in descending order.

これにより、文書検索処理において、テキストノードを持つノードが所定の適合条件を満たしている場合、このノードが属する親ノードを、検索条件の合致度が高いノードとして扱うことができる。そのうえで、検索条件の合致度がより高いノードを、必要な数だけＸＭＬ文書の中から検索することができる。さらに、検索されたノードを、適切な順序で表示することができる。このため、適切な単位および数の部分文書を検索することができるうえ、検索された部分文書を適切な順序で表示することができ、結果的に、文書検索処理における検索精度およびユーザビリティの向上を図ることができる。 Thereby, in a document search process, when a node having a text node satisfies a predetermined matching condition, a parent node to which this node belongs can be handled as a node having a high matching degree of the search condition. In addition, a required number of nodes having a higher matching degree of the search condition can be searched from the XML document. Furthermore, the retrieved nodes can be displayed in an appropriate order. Therefore, it is possible to search for an appropriate unit and number of partial documents, and to display the searched partial documents in an appropriate order. As a result, the search accuracy and usability in the document search process can be improved. Can be planned.

なお、この発明にかかる文書検索装置、文書検索方法および文書検索プログラムは、階層構造化された文書セットであれば、ＸＭＬ文書以外の文書に対する文書検索にも適用することができる。また、ファイル化された文書に限らず、たとえば、データベース化された文書に対する文書検索にも適用することができる。さらに、単独のファイルにファイル化された文書や単独のデータベースにデータベース化された文書に限らず、複数のファイルにファイル化された文書や、複数のデータベースにデータベース化された文書に対する文書検索にも適用することができる。 Note that the document search apparatus, document search method, and document search program according to the present invention can be applied to document search for documents other than XML documents as long as the document set has a hierarchical structure. Further, the present invention is not limited to a filed document, and can be applied to a document search for a databased document, for example. Furthermore, not only documents filed in a single file or documents databased in a single database, but also document searches for documents filed in multiple files and documents databased in multiple databases. Can be applied.

なお、本実施の形態で説明した文書検索方法は、予め用意されたプログラムをパーソナル・コンピュータやワークステーション等のコンピュータで実行することにより実現することができる。このプログラムは、ハードディスク、フレキシブルディスク、ＣＤ−ＲＯＭ、ＭＯ、ＤＶＤ等のコンピュータで読み取り可能な記録媒体に記録され、コンピュータによって記録媒体から読み出されることによって実行される。またこのプログラムは、インターネット等のネットワークを介して配布することが可能な伝送媒体であってもよい。 The document search method described in this embodiment can be realized by executing a program prepared in advance on a computer such as a personal computer or a workstation. This program is recorded on a computer-readable recording medium such as a hard disk, a flexible disk, a CD-ROM, an MO, and a DVD, and is executed by being read from the recording medium by the computer. The program may be a transmission medium that can be distributed via a network such as the Internet.

以上のように、本発明にかかる文書検索装置、文書検索方法および文書検索プログラムは、階層構造化された文書セットから、自然文により入力された検索条件に合致するノードを検索するパーソナル・コンピュータ、ドキュメントサーバ、文書検索ソフトウェアなどへの利用に適している。 As described above, the document search apparatus, the document search method, and the document search program according to the present invention are a personal computer that searches a hierarchically-structured document set for a node that matches a search condition input by a natural sentence, Suitable for use in document servers, document search software, etc.

この実施の形態にかかる文書検索装置のハードウェア構成の一例を示すブロック図である。It is a block diagram which shows an example of the hardware constitutions of the document search device concerning this embodiment. この実施の形態にかかる文書検索装置の機能的構成を示すブロック図である。It is a block diagram which shows the functional structure of the document search device concerning this embodiment. この発明の実施の形態にかかる文書検索装置に用いられるＸＭＬ文書の一例を示す説明図である。It is explanatory drawing which shows an example of the XML document used for the document search apparatus concerning embodiment of this invention. 生成部によるノードリストの生成手順の一例を示すフローチャートである。It is a flowchart which shows an example of the production | generation procedure of the node list by a production | generation part. 生成部によって生成されたノードリストの一例を示す説明図である。It is explanatory drawing which shows an example of the node list produced | generated by the production | generation part. この発明の実施の形態にかかる文書検索装置による文書検索処理の手順の一例を示すフローチャートである。It is a flowchart which shows an example of the procedure of the document search process by the document search apparatus concerning embodiment of this invention. 算出部によって算出されたスコアの一例を示す説明図である。It is explanatory drawing which shows an example of the score calculated by the calculation part. 加算部によって加算されたスコアの一例を示す説明図である。It is explanatory drawing which shows an example of the score added by the addition part. 決定部によってソートされたノードリストの一例を示す説明図である。It is explanatory drawing which shows an example of the node list sorted by the determination part. 表示部に表示された検索結果の一例を示す説明図である。It is explanatory drawing which shows an example of the search result displayed on the display part.

Explanation of symbols

１００文書検索装置
１０１ＣＰＵ
１０２ＲＯＭ
１０３ＲＡＭ
１０４ＨＤＤ
１０５ＨＤ
１０６ＦＤＤ
１０７ＦＤ
１０８ＣＤ−ＲＷドライブ
１０９ＣＤ−ＲＷ
１１０ディスプレイ
１１１キーボード
１１２マウス
１１３ネットワークＩ／Ｆ
１１４通信ケーブル
１２０バス
２０１取得部
２０２生成部
２０３入力部
２０４算出部
２０５判断部
２０６加算部
２０７決定部
２０８出力制御部
２０９表示部 100 Document Retrieval Device 101 CPU
102 ROM
103 RAM
104 HDD
105 HD
106 FDD
107 FD
108 CD-RW drive 109 CD-RW
110 Display 111 Keyboard 112 Mouse 113 Network I / F
114 communication cable 120 bus 201 acquisition unit 202 generation unit 203 input unit 204 calculation unit 205 determination unit 206 addition unit 207 determination unit 208 output control unit 209 display unit

Claims

A document search device that searches for a node that matches a search condition input by a natural sentence from a hierarchically structured document set,
Obtaining means for obtaining the document set;
Generating means for generating a node list from the document set acquired by the acquiring means;
Input means for receiving input of the search condition;
Calculation means for calculating a score indicating the degree of match of the search condition based on the search condition input by the input means for each node indicated in the node list generated by the generation means;
Determination means for determining whether or not a predetermined conformity condition is satisfied based on the score calculated by the calculation means for each node indicated in the node list generated by the generation means;
Adding means for adding a score of a node determined to satisfy a predetermined matching condition by the determining means to a score of a parent node to which the node belongs;
Based on the score added by the adding means and the score calculated by the calculating means, a node having a high degree of match of the search condition is selected as a search result from the node list generated by the generating means. A decision means to decide;
A document retrieval apparatus comprising:

2. The document search apparatus according to claim 1, further comprising output control means for controlling output so that nodes determined by the determination means are displayed in descending order of matching degree of the search conditions.

The determination unit sorts the node list generated by the generation unit in descending order of matching degree of the search condition based on the score added by the addition unit and the score calculated by the calculation unit. 3. The document search apparatus according to claim 1, wherein a predetermined number of nodes are determined as search results from the top of the sorted node list.

The calculation means uses the TF-IDF method to calculate the degree of match of the search conditions based on the search conditions input by the input means for each node indicated in the node list generated by the generation means. The document search apparatus according to claim 1, wherein a score to be calculated is calculated.

A document search method for searching a node that matches a search condition input by a natural sentence from a hierarchically structured document set,
An acquisition step of acquiring the document set;
A generation step of generating a node list from the document set acquired by the acquisition step;
An input step for receiving an input of the search condition;
A calculation step of calculating a score indicating a degree of match of the search condition based on the search condition input by the input step for each node indicated in the node list generated by the generation step;
A determination step for determining whether or not a predetermined matching condition is satisfied based on the score calculated by the calculation step for each node indicated in the node list generated by the generation step;
An adding step of adding the score of the node determined to satisfy the predetermined matching condition by the determining step to the score of the parent node to which the node belongs;
Based on the score added by the adding step and the score calculated by the calculating step, a node having a high degree of match of the search condition is selected as a search result from the node list generated by the generating step. A decision process to decide;
A document retrieval method characterized by causing a computer to execute.

A document search program for searching a node that matches a search condition input by a natural sentence from a hierarchically structured document set,
An acquisition step of acquiring the document set;
A generation step of generating a node list from the document set acquired by the acquisition step;
An input step for receiving an input of the search condition;
A calculation step of calculating a score indicating a degree of match of the search condition based on the search condition input by the input step for each node indicated in the node list generated by the generation step;
A determination step for determining whether or not a predetermined matching condition is satisfied based on the score calculated by the calculation step for each node indicated in the node list generated by the generation step;
An adding step of adding the score of the node determined to satisfy the predetermined matching condition by the determining step to the score of the parent node to which the node belongs;
Based on the score added by the adding step and the score calculated by the calculating step, a node having a high degree of match of the search condition is selected as a search result from the node list generated by the generating step. A decision process to decide;
Document search program characterized by causing a computer to execute.