JP2006178599A

JP2006178599A - Document retrieval device and method

Info

Publication number: JP2006178599A
Application number: JP2004369143A
Authority: JP
Inventors: Yasuhiro Ishitobi; 康浩石飛
Original assignee: Fuji Xerox Co Ltd
Current assignee: Fujifilm Business Innovation Corp
Priority date: 2004-12-21
Filing date: 2004-12-21
Publication date: 2006-07-06

Abstract

<P>PROBLEM TO BE SOLVED: To easily and properly refine searches. <P>SOLUTION: A search condition A is inputted via a user interface part 10 and sent to a document retrieval part 13 (X01). Either a full-text search performing part 20 or an attribute search performing part 22 performs a search about the search condition A and outputs the search result to a search result acquisition part 14 (X02). The search result acquisition part 14 delivers the search result to an important word extracting part 14 and refers to a related document search index to request a group of characteristic important words to be obtained (X03). A display synthesizing part 16 receives the group of important words (characteristic words) and creates display data for display together with the search result, and the user interface part 10 displays the data (X04). A search refinement button B is operated to refine the search (X05). The display of the search result and the search refinement can be easily repeated by operation of the search button B. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

この発明は、文書検索技術に関し、とくに、検索結果の絞り込みを、適切かつ簡易に行なえるようにしたものである。 The present invention relates to a document search technique, and in particular, can narrow down search results appropriately and easily.

検索を行った場合に検索結果件数が多いことがある。この場合、さらに、絞込み検索を行い、アクセス件数を削減することが望まれる。ここで、検索結果から適切な絞込み条件の作成を効率的に行うことが重要である。以下に示す従来技術では、絞込みの検索語候補の決定方法がポイントとなっているが、それぞれ欠点がある。 When searching, the number of search results may be large. In this case, it is desirable to further perform a narrow search to reduce the number of accesses. Here, it is important to efficiently create an appropriate narrowing-down condition from the search result. In the prior art described below, the method for determining candidate search terms is the key point, but each has drawbacks.

特許文献１では、検索対象文書に対して名詞句を抽出しインデキシングした情報を準備しておき、この名詞句ごとのインデックスレコードを用いて、ユーザが入力した検索語を含む名詞句を絞込み検索用の検索語として提示する。この方法では、ユーザが入力した単語と関連するが字面が異なる単語が提示されないという欠点がある。 In Patent Document 1, information obtained by extracting and indexing a noun phrase from a search target document is prepared, and the index record for each noun phrase is used to narrow down a noun phrase including a search term input by a user. As a search term. This method has a drawback in that a word related to a word input by the user but having a different character face is not presented.

特許文献２では、ユーザから入力された検索条件の履歴を用いて、入力した検索語と関連する検索語候補を求めている。しかしながら、本方法では履歴として蓄積された検索条件データが存在しなければ候補が得られないという欠点がある。 In Patent Document 2, a search term candidate related to an input search term is obtained using a history of search conditions input from a user. However, this method has a drawback that candidates cannot be obtained unless search condition data accumulated as a history exists.

特許文献３では、検索時に各検索結果を形態素解析し、結果中の語およびその語の文書内での出現頻度を求め、ある閾値以下の検索語を絞込みのための検索語候補としている。この従来技術での問題点は、検索時に各検索結果を形態素解析しており、ユーザが検索結果を得るまでの処理時間が長い点、絞込み検索語の候補として出現頻度が低い語を選択しており、適切な検索語が選択できない点が挙げられる。具体的には、各検索結果のテキスト長（文章が短いものが選択される傾向になる問題）や検索結果群あるいは検索対象文書全体での文書出現頻度が加味されていない点である。 In Patent Literature 3, each search result is subjected to morphological analysis at the time of search, a word in the result and an appearance frequency of the word in a document are obtained, and a search word having a threshold value or less is set as a search word candidate for narrowing down. The problem with this prior art is that each search result is morphologically analyzed at the time of search, and the processing time until the user obtains the search result is long. In other words, an appropriate search term cannot be selected. Specifically, the text length of each search result (problem in which a sentence having a short sentence tends to be selected), the search result group, or the document appearance frequency in the entire search target document is not taken into consideration.

また、複数の絞り込み用の検索語が指定できるので、結果文書群からの絞込みが極端に行われ、絞込み結果が０件になってしまう可能性が高くなる。
特開２００２−３４２３７３公報特開２００３−１０８５９４公報特開２００４−５４６１９公報 Further, since a plurality of search terms for narrowing down can be designated, narrowing down from the result document group is extremely performed, and there is a high possibility that the narrowing down result becomes zero.
JP 2002-342373 A JP 2003-108594 A JP 2004-54619 A

この発明は、以上の事情を考慮してなされたものであり、検索結果の絞り込みを適切かつ簡易に行なえる文書検索技術を提供することを目的としている。 The present invention has been made in view of the above circumstances, and an object thereof is to provide a document search technique that can narrow down search results appropriately and easily.

この発明の原理的な構成例においては、属性検索や文書内容検索の結果文書群に対して、関連文書検索用のインデックス情報（所定の単語集合に含まれる単語とその重要度を含む）を用いて、結果文書群内の特徴的な語（関連語およびその重要度。重要度は典型的にはインデックス情報から算出したスコアで表される）を取得し、その関連語を検索結果文書群と共に提示する。 In the principle configuration example of the present invention, index information for related document search (including words included in a predetermined word set and its importance level) is used for a document group as a result of attribute search or document content search. Thus, a characteristic word (related word and its importance. The importance is typically expressed by a score calculated from the index information) in the result document group is acquired, and the related word is obtained together with the search result document group. Present.

このとき、関連語群から既に検索条件内で用いられている文字列を含む関連語や検索結果すべてに含まれる関連語を除き、重要度順に上位の関連語を絞込みに利用する。なお、検索結果文書すべてに含まれる関連語を用いて絞り込みを行なっても何ら絞り込みができない。 At this time, the related words including the character string already used in the search condition and the related words included in all the search results are excluded from the related word group, and the higher related words are used for narrowing down in order of importance. It should be noted that no narrowing can be performed even if narrowing is performed using related terms included in all search result documents.

具体的には、文書管理システムへ文書を格納した際に以下の２種類のインデックスデータを作成する。第１は、文書の識別子から当該文書内に出現する単語とその頻度情報を求めることができるインデックスデータであり、いわゆる関連文書検索に用いられる。第２は、単語（キーワードや属性）からその単語を含む文書の識別子群を得ることができるインデックスデータであり、属性検索や内容検索（全文検索、キーワード検索）に用いられる。 Specifically, the following two types of index data are created when a document is stored in the document management system. The first is index data that can obtain the word appearing in the document and frequency information thereof from the document identifier, and is used for so-called related document search. The second is index data that can obtain an identifier group of a document including the word from the word (keyword or attribute), and is used for attribute search or content search (full-text search, keyword search).

この構成例では、文書に割り当てた属性を用いた属性検索や、ｎ−ｇｒａｍを用いた全文検索のためのインデックスデータを用いる。これら属性検索や全文検索（内容検索）を行なったのち、関連文書検索のインデックスを用いて結果文書群内の特徴的な語を取得して、検索結果を絞り込めるようにする。 In this configuration example, index data for attribute search using attributes assigned to a document and full-text search using n-gram is used. After performing these attribute searches and full-text searches (content searches), the related document search index is used to acquire characteristic words in the result document group so that the search results can be narrowed down.

単語（関連語）およびその重要度の計算方法については、弊社の特許第３４２７６７４号公報または特開２００２−３２４１１公報に記載された方法を利用する。端的にいえば、例えば、ＴＦ（ＴｅｒｍＦｒｅｑｕｅｎｃｙ）およびＩＤＦ（ＩｎｖｅｒｓｅＤｏｃｕｍｅｎｔＦｒｅｑｕｅｎｃｙ）による重みで重要度を計算する。単語の重みを要素とする重みベクトルを各文書のインデックスデータとして用いる（「情報検索と言語処理」徳永健伸著、東京大学出版会、１９９９年）。 As a method for calculating a word (related word) and its importance, a method described in Japanese Patent No. 3427774 or Japanese Patent Laid-Open No. 2002-32411 is used. In short, for example, the importance is calculated using a weight based on TF (Term Frequency) and IDF (Inverse Document Frequency). A weight vector having word weight as an element is used as index data for each document ("Information Retrieval and Language Processing" written by Takenobu Tokunaga, The University of Tokyo Press, 1999).

この構成例では、検索結果を表示する際に、絞込検索条件が設定された複数の絞込み検索実行ボタン等を表示する。例えば絞り込み用の単語ごとにボタンを表示する。この絞込検索条件は、上述のようにして求めた関連語を当初の「検索条件」とＡＮＤ検索条件で連結した絞込み検索条件とする。絞込み検索ボタンを操作することにより、１ステップで絞込検索の実行／検索結果取得が行え、これを繰り返すこともできる。 In this configuration example, when a search result is displayed, a plurality of narrow search execution buttons and the like in which a narrow search condition is set are displayed. For example, a button is displayed for each narrowing word. The narrowing search condition is a narrowing search condition in which the related terms obtained as described above are connected to the original “search condition” by the AND search condition. By operating the search refinement button, it is possible to execute a refinement search / acquire search results in one step, and to repeat this.

なお「検索条件」には限らず、その条件を満たす「文書識別子の集合」でもよい。つまり、絞り込む対象を限定できる情報があればよい。 The search condition is not limited to “a set of document identifiers” that satisfies the condition. That is, it is sufficient if there is information that can limit the target to be narrowed down.

この構成例は、具体的は、以下の手順を実行する。 Specifically, in this configuration example, the following procedure is executed.

［ステップ１］：検索条件を入力する。
［ステップ２］：検索条件に適合する検索結果を求める。
［ステップ３］：検索結果内の関連語を求める。ここで、検索条件内の検索文字列を含む関連語、および、検索結果文書全てに出現する関連語は除き、上位Ｘ件（Ｘは正の整数）を求める。
［ステップ４］：ステップ３で求めた関連語と、ステップ１の検索条件またはステップ２で求めた文書識別子集合とを、ＡＮＤ論理演算で結合した絞込検索条件を作成する。
［ステップ５］：検索結果および絞込検索条件式をセットした絞込み検索ボタンを検索結果表示する。 [Step 1]: Enter search conditions.
[Step 2]: A search result matching the search condition is obtained.
[Step 3]: Find related terms in the search results. Here, excluding related terms including the search character string within the search condition and related terms appearing in all search result documents, the top X items (X is a positive integer) are obtained.
[Step 4]: A refined search condition is created by combining the related term obtained in Step 3 and the search condition in Step 1 or the set of document identifiers obtained in Step 2 by AND logic operation.
[Step 5]: A search result button with a search result and a narrow search condition expression set is displayed.

上述のステップ２〜５は繰り返し実行できる。 Steps 2 to 5 described above can be repeated.

さらに、この発明を説明する。なお、理解を容易にするために後述する実施例で用いた参照符号を付すが、これは権利内容を限定する意図ではない。 Further, the present invention will be described. In addition, in order to make an understanding easy, the reference sign used in the Example mentioned later is attached | subjected, but this is not the intention which limits the content of rights.

この発明の一側面によれば、上述の目的を達成するために、文書検索装置（１００）に：文書群に対して検索を行なう検索手段（２０、２２）と；文書ごとに重要語句情報を記憶する重要語句情報記憶手段（２５）と；上記重要語句情報記憶手段を参照して上記検索手段の検索結果に含まれる文書の重要語句情報を取得し、取得した重要語句情報に基づいて絞り込み用の語を決定する絞り込み語決定手段（１５）と；上記検索手段に対して、上記絞り込み用の語を用いた上記検索結果の絞り込み検索を実行するように指示する絞り込み検索指示手段（１０、１６）とを設けている。 According to one aspect of the present invention, in order to achieve the above object, the document search device (100): search means (20, 22) for searching a document group; Important word / phrase information storage means (25) for storing; referring to the important word / phrase information storage means, obtaining important word / phrase information of the document included in the search result of the search means, and narrowing down based on the acquired important word / phrase information Refined word determining means (15) for determining the word of the search; refined search instruction means (10, 16) for instructing the search means to perform a refinement search of the search result using the refinement word. ).

この構成においては、検索結果から検索結果に含まれる文書の重要語句情報を、重要語句情報記憶手段を参照して取得して、重要な語句を絞り込み用の語として選定することができる。 In this configuration, the important phrase information of the document included in the search result can be acquired from the search result with reference to the important phrase information storage unit, and the important phrase can be selected as a narrowing word.

この構成において、上記検索手段は、例えば、ｎ−ｇｒａｍに基づくインデックスデータを参照して全文検索を行なう検索手段や、語ごと準備されたインデックスデータを参照して全文検索を行なう検索手段や、文書の属性に基づいて準備されたインデックスデータを参照して属性検索を行なう検索手段である。 In this configuration, the search means includes, for example, a search means for performing full text search with reference to index data based on n-gram, a search means for performing full text search with reference to index data prepared for each word, a document Search means for performing an attribute search with reference to index data prepared on the basis of the attribute.

上記重要語句情報は、文書ごとに準備された、所定の単語集合に含まれる単語の重みの組である。この場合、文書ごとの各単語の重みは、当該文書中の当該各単語の頻度および上記単語が表れる文書の数に基づいて計算されることが好ましい。 The important phrase information is a set of weights of words included in a predetermined word set prepared for each document. In this case, the weight of each word for each document is preferably calculated based on the frequency of each word in the document and the number of documents in which the word appears.

上記重要語句情報記憶手段に記憶されている、文書ごとの重要語句情報は、典型的には、所定の語に関連する文書を検索する関連文書検索手段に用いる関連文書検索証インデックスの情報である。 The important phrase information for each document stored in the important phrase information storage means is typically information on a related document search certificate index used for related document search means for searching for a document related to a predetermined word. .

また、上記絞り込み語決定手段は、絞り込み用の語の候補が検索結果の文書に所定割合以上含まれている場合には、絞り込み用の語から外すことが好ましい。所定割合は、絞り込みの程度に応じて選定するようにしても良い。所定割合は、各文書における各単語の頻度情報から取得できる。ｔｆ＊ＩＤＦを計算するためのｔｆを用いて（当該語のｔｆ＞０である文書の割合）絞込み割合を算出できる。 Further, it is preferable that the narrowing-down word determining means excludes the narrowing-down word candidates from the narrowing-down words if the search result document includes a predetermined ratio or more. The predetermined ratio may be selected according to the degree of narrowing down. The predetermined ratio can be acquired from the frequency information of each word in each document. By using tf for calculating tf * IDF (ratio of documents in which the word is tf> 0), the narrowing down ratio can be calculated.

典型的には、検索結果の文書すべてに含まれる候補を絞込み用の語から外すようにしても良い。この場合ｔｆからも判別できるが、ＩＤＦ＝ｌｏｇ（Ｎ／ｄｆ）（Ｎは対象文書総数、ｄｆは当該語を含む文書の頻度）がゼロとなるので（ｔｆ＊ＩＤＦもゼロ）、これから判別することもできる。 Typically, candidates included in all search result documents may be excluded from narrowing words. In this case, although it can also be determined from tf, IDF = log (N / df) (N is the total number of target documents, df is the frequency of documents including the word) is zero (tf * IDF is also zero), and is determined from this. You can also.

文書ごとの重要語句情報を、所定の単語集合に含まれる各単語の当該文書における重要度（例えば単語ベクトル）とすることができる。この場合、単語ベクトルの各要素を検索結果文書に渡って足し込んでいき各要素ごとのスコアを取得し、検索結果全体における重要度を判定する指標にできる。ただし、スコアが大きい場合には、検索結果中の大部分の文書において出現している蓋然性が大きいので、スコアが所定の閾値より小さな上位Ｘ個と選定しても良い。 The important phrase information for each document can be the importance (for example, a word vector) in the document of each word included in the predetermined word set. In this case, each element of the word vector is added to the search result document to obtain a score for each element, and can be used as an index for determining the importance in the entire search result. However, if the score is large, the probability of appearing in most of the documents in the search results is high, so the top X scores with a score smaller than a predetermined threshold may be selected.

また、絞り込み用の語の各々について、検索結果の文書中、当該絞り込み用の語を含む文書が占める割合を表示するようにしてもよい。このようにすると、各語を用いた場合に絞り込み結果を知ることができ、それを指標にして絞り込み用の語を選択できる。当該絞り込み用の語を含む文書が占める割合は、上述のとおり、例えばｔｆの値を参照して算出できる。 Further, for each of the narrowing words, the ratio of the document including the narrowing word in the search result document may be displayed. In this way, it is possible to know the narrowing result when each word is used, and it is possible to select a narrowing word using it as an index. As described above, the ratio of the document including the narrowing-down word can be calculated with reference to the value of tf, for example.

また、この発明の他の側面によれば、文書検索装置（１００）に：文書群に対して検索を行なう検索手段（２０、２２）と；文書ごとに重要語句情報を記憶する重要語句情報記憶手段（２５）と；上記重要語句情報記憶手段を参照して上記検索手段の検索結果に含まれる文書の重要語句情報を決定する手段（１５）と；上記検索結果と上記重要語句とを表示する検索結果・重要語句表示手段（１０、１６）と；表示された上記重要語句に対するユーザの操作に応答して操作対象の重要語句を用いた上記検索結果の絞り込み検索を実行するように指示する絞り込み検索指示手段（１０）とを設けるようにしている。 According to another aspect of the present invention, the document retrieval apparatus (100) includes: retrieval means (20, 22) for retrieving a document group; important phrase information storage for storing important phrase information for each document. Means (25); means (15) for determining important phrase information of a document included in the search result of the search means by referring to the important phrase information storage means; and displaying the search result and the important phrase Search result / important phrase display means (10, 16); a refinement for instructing to perform a refinement search of the search result using the important phrase to be operated in response to a user operation on the displayed important phrase Search instruction means (10) is provided.

なお、この発明は装置またはシステムとして実現できるのみでなく、方法としても実現可能である。また、そのような発明の一部をソフトウェアとして構成することができることはもちろんである。またそのようなソフトウェアをコンピュータに実行させるために用いるソフトウェア製品もこの発明の技術的な範囲に含まれることも当然である。 The present invention can be realized not only as an apparatus or a system but also as a method. Of course, a part of the invention can be configured as software. Of course, software products used for causing a computer to execute such software are also included in the technical scope of the present invention.

この発明の上述の側面および他の側面は特許請求の範囲に記載され以下実施例を用いて詳述される。 These and other aspects of the invention are set forth in the appended claims and will be described in detail below with reference to examples.

この発明によれば、属性検索や内容検索の検索結果を簡易かつ適切に絞り込みことができる。 According to the present invention, search results of attribute search and content search can be narrowed down easily and appropriately.

以下、この発明の実施例について説明する。 Examples of the present invention will be described below.

図１は、この発明を文書管理システムに適用した実施例を全体として示している。この例では、コンピュータ１０１に記録媒体１０２を用いてコンピュータプログラムをインストールすることにより文書管理システム１００が実現される。図示の各ブロックは、コンピュータ１０１のハードウェア資源およびソフトウェア資源を協働させることにより実現される機能を表している。この例では、単一のコンピュータシステムによる実装手法を用いたが、サーバコンピュータとクライアント装置とをネットワークで接続して構成しても良い。 FIG. 1 shows an embodiment in which the present invention is applied to a document management system as a whole. In this example, the document management system 100 is realized by installing a computer program on the computer 101 using the recording medium 102. Each block shown in the figure represents a function realized by cooperating hardware resources and software resources of the computer 101. In this example, a mounting method using a single computer system is used, but a server computer and a client device may be connected via a network.

図１において、文書管理システム１００は、ユーザインタフェース部１０、文書管理部１１、登録文書群記憶部１２、文書検索部１３、検索結果取得部１４、重要語抽出部１５、表示合成部１６、インデックス生成部１７等を含んで構成されている。文書検索部１３は、この例では、全文検索実行部２０、全文検索用インデックス記憶部２１、属性検索実行部２２、属性検索用インデックス記憶部２３、関連文書検索実行部２４、関連文書検索用インデックス記憶部２５等を含んで構成される。 In FIG. 1, a document management system 100 includes a user interface unit 10, a document management unit 11, a registered document group storage unit 12, a document search unit 13, a search result acquisition unit 14, a keyword extraction unit 15, a display synthesis unit 16, an index. The generation unit 17 and the like are included. In this example, the document search unit 13 includes a full-text search execution unit 20, a full-text search index storage unit 21, an attribute search execution unit 22, an attribute search index storage unit 23, a related document search execution unit 24, and a related document search index. The storage unit 25 and the like are included.

ユーザインタフェース部１０は、ユーザからの入力を受け付け、文書管理システム１００からの出力を表示するものである。検索に限定して述べれば、図４に示すような入力フォームを用いてユーザが検索条件を入力する。キーボード入力しても良いし、属性検索の属性等はプルダウンメニューを用いることができる。検索条件は、キーワード、属性、自然文等で指定できる。自然文は形態素解析されてキーワード抽出され、当該キーワードの形態または後述する単語ベクトルの形態で検索に用いることができる。ユーザインタフェース部１０は例えばウェブベースで実装することができるが、これに限定されない。 The user interface unit 10 receives input from the user and displays output from the document management system 100. If limited to search, the user inputs search conditions using an input form as shown in FIG. Keyboard input may be used, and pull-down menus may be used for attribute search attributes and the like. Search conditions can be specified by keywords, attributes, natural sentences, and the like. The natural sentence is subjected to morphological analysis and extracted as a keyword, and can be used for searching in the form of the keyword or a word vector described later. The user interface unit 10 can be implemented on a web basis, for example, but is not limited thereto.

文書管理部１１は、ユーザが登録した文書群を登録文書群記憶部１２に保持して管理するものである。各文書は図８に示すような管理属性等が割り当てられる。 The document management unit 11 manages the document group registered by the user by holding it in the registered document group storage unit 12. Each document is assigned a management attribute as shown in FIG.

文書検索部１３は、全文検索、属性検索、関連文書検索を行なう。これら各種の検索を論理式で組み合わせて行なっても良い。 The document search unit 13 performs full-text search, attribute search, and related document search. These various searches may be performed in combination with logical expressions.

文書検索部１３の全文検索実行部２０が全文検索用インデックス記憶部２１に保持された全文検索用インデックスを用いて全文検索を行なう。全文検索用インデックスは、例えば図５に示すように各ｎ−ｇｒａｍごとにそれが含まれる文書識別子、その出現位置（複数の場合もある）がインデックスレコードとして記録されて構成されている。インデックスレコードは検索対象文書、典型的には登録文書群記憶部１２に記憶された文書をカバーするように生成される。 The full text search execution unit 20 of the document search unit 13 performs a full text search using the full text search index held in the full text search index storage unit 21. For example, as shown in FIG. 5, the full-text search index is configured by recording a document identifier included in each n-gram and its appearance position (may be plural) as an index record. The index record is generated so as to cover a search target document, typically a document stored in the registered document group storage unit 12.

検索条件として入力されたキーワードを構成するｎ−ｇｒａｍとその位置関係から、図５に示すようなインデックスレコードを参照して該当する文書識別子をリストアップして検索結果とする。なお、図５に示すｎ−ｇｒａｍごとのインデックスレコードでなく、文書を形態素解析してその形態素（語、キーワード）ごとのインデックスレコードを用いても良い。この例を図９に示す。 From the n-grams constituting the keywords input as the search conditions and their positional relationships, the corresponding document identifiers are listed by referring to the index records as shown in FIG. Instead of the index record for each n-gram shown in FIG. 5, the index record for each morpheme (word, keyword) may be used after morphological analysis of the document. An example of this is shown in FIG.

属性検索実行部２２は、属性検索用インデックス記憶部２３に保持された属性検索用インデックスを用いて属性検索を行なう。属性検索用インデックスは例えば図６に示すように各属性の属性値（属性値範囲）ごとに該当する文書識別子がインデックスレコードとして記録された構成されている。インデックスレコードは検索対象文書、典型的には登録文書群記憶部１２に記憶された文書をカバーするように生成される。属性は属性管理部１１が管理する属性情報（図８）である。属性検索実行部２２は、検索条件を構成する属性値に該当する属性を有する文書の文書識別子を属性検索用インデックスを参照してリストアップして検索結果とする。 The attribute search execution unit 22 performs an attribute search using the attribute search index held in the attribute search index storage unit 23. For example, as shown in FIG. 6, the attribute search index is configured such that a document identifier corresponding to each attribute value (attribute value range) of each attribute is recorded as an index record. The index record is generated so as to cover a search target document, typically a document stored in the registered document group storage unit 12. The attribute is attribute information (FIG. 8) managed by the attribute management unit 11. The attribute search execution unit 22 lists the document identifiers of documents having attributes corresponding to the attribute values constituting the search condition with reference to the attribute search index to obtain a search result.

関連文書検索実行部２４は、関連文書検索用インデックス記憶部２５に保持された関連文書検索用インデックスを用いて関連文書検索を行なう。この例では、関連文書検索用のインデックスは図７に示すようなインデックスレコードを保持する。各インデックスは、検索対象の文書、典型的には登録文書群記憶部１２に記憶された文書（文書識別子）ごとに重要語句情報、例えば、単語ベクトルを割り当てている。単語ベクトルは所定範囲の単語集合の当該文書における重みｗ１，ｗ２，・・・ｗｎで構成される。重みは、典型的にはｔｆ＊ＩＤＦに基づいて生成される。単語ベクトルはそれぞれ正規化されている。ｔｆは当該文書中に当該語が出現する頻度であり、ＩＤＦは一例としてはｌｏｇ（Ｎ／ｄｆ）であるが、これに限定されない。ｌｏｇ（Ｎ／ｄｆ）＋１等でもよく、対数の底も種々採用できる。ただしＮは対象文書の総数、ｄｆは当該語が出現する文書数である。 The related document search execution unit 24 performs a related document search using the related document search index held in the related document search index storage unit 25. In this example, the related document search index holds an index record as shown in FIG. In each index, important phrase information, for example, a word vector is assigned to each document to be searched, typically for each document (document identifier) stored in the registered document group storage unit 12. The word vector is composed of weights w1, w2,. The weight is typically generated based on tf * IDF. Each word vector is normalized. tf is the frequency of occurrence of the word in the document, and IDF is log (N / df) as an example, but is not limited thereto. log (N / df) +1 or the like, and various logarithmic bases can be employed. N is the total number of target documents, and df is the number of documents in which the word appears.

関連文書検索実行部２４は、入力された１または複数のキーワード（単語ベクトル）と各文書の単語ベクトルとの内積等を行なって、所定数または所定の閾値を越える値の文書をリストアップして検索結果とする。自然文から得たキーワード（単語ベクトル）や種文書の単語ベクトルを用いた関連文書検索も行なえる。 The related document search execution unit 24 performs an inner product or the like of the input one or more keywords (word vectors) and the word vectors of each document, and lists documents having a predetermined number or a value exceeding a predetermined threshold. Let it be a search result. A related document search using a keyword (word vector) obtained from a natural sentence or a word vector of a seed document can also be performed.

検索結果取得部１４は文書検索部１３から検索結果として文書識別子集合を取得する。検索結果としての文書識別子集合は典型的には表示合成部１６に渡され、表示合成部１６が検索結果一覧の表示データを生成してユーザインタフェース部１０に供給する。ユーザインタフェース部１０は表示データに基づいて検索結果一覧を表示する。 The search result acquisition unit 14 acquires a document identifier set as a search result from the document search unit 13. A set of document identifiers as a search result is typically transferred to the display composition unit 16, and the display composition unit 16 generates display data for the search result list and supplies it to the user interface unit 10. The user interface unit 10 displays a search result list based on the display data.

さらに、検索結果取得部１４は、検索が関連文書検索以外の場合、すなわち、全文検索や属性検索またはその組み合わせである場合には、検索結果の文書識別子集合を重要語抽出部１５にも供給する。重要語抽出部１５は、検索結果（文書識別子集合）の重要語句情報（単語ベクトル）を関連文書検索用インデックス記憶部２５を参照して取得し、これに基づいて検索結果に対して重要と予想される語を抽出する。例えば、検索結果の文書の単語ベクトルを各要素（単語）ごとに累積してその要素の値に応じて単語を選択する。すなわち、検索結果がｍ個の文書からなる場合、単語ベクトルの各要素ｗｉをｍ個の文書について累積していき、所定の閾値または上位所定数の要素ｉ（単語）を重要語として表示合成部１６に出力する。 Further, when the search is other than the related document search, that is, when the search is a full-text search, an attribute search, or a combination thereof, the search result acquisition unit 14 also supplies the search result document identifier set to the keyword extraction unit 15. . The keyword extraction unit 15 acquires the keyword information (word vector) of the search result (document identifier set) with reference to the related document search index storage unit 25, and based on this, the keyword is expected to be important for the search result. The words to be extracted. For example, the word vectors of the search result document are accumulated for each element (word), and the word is selected according to the value of the element. That is, when the search result is composed of m documents, each element wi of the word vector is accumulated for the m documents, and a predetermined threshold or an upper predetermined number of elements i (words) is used as an important word. 16 is output.

表示合成部１６は、図３に示すように、検索結果の一覧Ａとともに、重要語抽出部で決定された重要語（特徴語ともいう）を表示する絞り込み検索用ボタンＢを表示する表示データを生成する。ユーザインタフェース部１０は、これに基づいて検索結果および絞り込みボタンを含む表示をユーザに提示する。図３では、「特徴語１」、「特徴語２」等と表示されるが、具体的な語が表示される。ユーザがいずれかのボタンＢをクリック等して操作した場合には、これに基づいてユーザインタフェース部１０が絞り込み検索条件を文書検索部１３に供給する。 As shown in FIG. 3, the display synthesis unit 16 displays display data for displaying a refinement search button B for displaying the keyword A (also referred to as a feature word) determined by the keyword extraction unit together with the list A of search results. Generate. Based on this, the user interface unit 10 presents a display including a search result and a narrow-down button to the user. In FIG. 3, “feature word 1”, “feature word 2”, and the like are displayed, but specific words are displayed. When the user performs an operation by clicking any button B or the like, the user interface unit 10 supplies a narrowing search condition to the document search unit 13 based on this.

当初の検索条件を検索条件Ａとすると絞り込み検索条件は「重要語ＡＮＤ検索条件Ａ」となる。検索結果の文書識別子集合との論理積をとるようにしてもよい。 If the initial search condition is the search condition A, the narrow-down search condition is “important word AND search condition A”. You may make it take a logical product with the document identifier set of a search result.

なお、重要語を複数選択してそれらの論理積または論理和等の論理式で絞り込み検索を指示しても良い。例えば、各絞り込み用検索ボタンをクリックすると、該当する重要語が選択され、条件入力用の入力フォームに転記され、これと演算子入力とも組み合わせて、重要語（特徴語）を含む論理式で当初の検索条件Ａ（文書識別子集合）を絞り込むことができる。 Note that a plurality of important words may be selected and a narrowing search may be instructed by a logical expression such as a logical product or logical sum of them. For example, if you click on the search button for each refinement, the relevant key word is selected and transferred to the input form for conditional input, and this is combined with the operator input to start with a logical expression that includes key words (feature words). Search condition A (document identifier set) can be narrowed down.

インデックス生成部１７は、登録文書群記憶部１２に保持されている文書群について全文検索用インデックス、属性検索用インデックス、関連文書検索用インデックスを生成・更新するものである。適宜なタイミングで新規の登録文書や変更文書を反映したインデックスを生成することが好ましい。なお、関連文書検索用インデックスとして例えばｔｆおよびＩＤＦから算出される重みを用いる場合には、ｔｆおよびＩＤＦを保持して新たな文書の情報をこれに反映させるようにする。このｔｆやＩＤＦを用いて補助的な情報例えば絞込み率を算出することができる。 The index generation unit 17 generates / updates a full-text search index, an attribute search index, and a related document search index for a document group held in the registered document group storage unit 12. It is preferable to generate an index reflecting a new registered document or a changed document at an appropriate timing. When the weight calculated from tf and IDF, for example, is used as the related document search index, tf and IDF are held to reflect the new document information. Using this tf and IDF, auxiliary information such as a narrowing down ratio can be calculated.

つぎに実施例の動作を図２に示す例に即して説明する。なお、この例では、検索条件Ａで全文検索または属性検索を行う場合のみ説明する。 Next, the operation of the embodiment will be described with reference to the example shown in FIG. In this example, only a case where a full text search or an attribute search is performed under the search condition A will be described.

まず、検索条件Ａが、ユーザインタフェース部１０を介して、図４に示すような入力フォームを用いて入力され、文書検索部１３に送られる（Ｘ０１）。全文検索実行部２０または属性検索実行部２２が検索条件Ａについて検索を行ない、検索結果を検索結果取得部１４に出力する（Ｘ０２）。検索結果取得部１４は、検索結果を重要語抽出部１５に引き渡して、関連文書検索用インデックスを参照して特徴的な重要語（特徴語）群を取得するよう要求する（Ｘ０３）。重要語群は、例えば、検索条件Ａ内の検索文字列を含む語（関連語、単語ベクトルの要素をなす語）や、検索結果文書すべてに含まれる語を除いた、上位Ｎ個の語からなる。検索結果文書すべてに当該語が含まれるかどうかは、検索結果文書のｔｆを参照したり、検索結果文書のＩＤＦ（ｔｆ＊ＩＤＦ）がゼロになるかどうかを調べて判別できる。表示合成部１６は、重要語（特徴語）群を受け取って検索結果と併せて表示する表示データを生成して、ユーザインタフェース部１０がこれを表示する（Ｘ０４）。絞り込み検索ボタンＢを操作等して絞り込み検索を行なう（Ｘ０５）。処理Ｘ０２〜Ｘ０５を繰り返し行なえる。また、絞込みを取り消すボタンを設けて直前の状態に戻れるようにしても良い。このようにして絞込みに用いる重要語を選択し直しても良い。 First, the search condition A is input using the input form shown in FIG. 4 via the user interface unit 10 and sent to the document search unit 13 (X01). The full text search execution unit 20 or the attribute search execution unit 22 searches for the search condition A, and outputs the search result to the search result acquisition unit 14 (X02). The search result acquisition unit 14 delivers the search result to the keyword extraction unit 15 and requests to acquire a characteristic keyword (feature word) group with reference to the related document search index (X03). The important word group is, for example, from the top N words excluding words including the search character string in the search condition A (related words, words forming the elements of the word vector) and words included in all the search result documents. Become. Whether or not the word is included in all the search result documents can be determined by referring to tf of the search result document or by checking whether or not the IDF (tf * IDF) of the search result document becomes zero. The display synthesis unit 16 receives the important word (feature word) group, generates display data to be displayed together with the search result, and the user interface unit 10 displays the display data (X04). A refinement search is performed by operating the refinement search button B (X05). Processes X02 to X05 can be repeated. In addition, a button for canceling the narrowing down may be provided so as to return to the previous state. In this way, important words used for narrowing down may be selected again.

なお、重要語は、キーワードであるので、典型的には、絞り込み条件は全文検索が対象になるが、重要語が属性として把握できる場合には絞り込み条件を属性検索とすることも可能である。 Since the important word is a keyword, the narrow-down condition is typically a full-text search. However, if the important word can be grasped as an attribute, the narrow-down condition can be set as an attribute search.

この実施例によれば、検索結果の特徴を表す語を関連文書検索用インデックスを参照するだけで取得することができ、簡易にかつ精度良く絞り込みが可能になる。 According to this embodiment, it is possible to acquire words representing the characteristics of the search result simply by referring to the related document search index, and it is possible to narrow down easily and accurately.

すなわち、この実施例では、ある検索条件で文書検索を実施した際に、検索結果と共にその検索結果をさらに絞込むために有効な検索条件候補を複数提示することができる。 That is, in this embodiment, when a document search is performed under a certain search condition, a plurality of search condition candidates effective for further narrowing down the search result can be presented together with the search result.

しかも、関連文書検索コンポーネントのようなインデックスを保持したシステムを利用して、検索結果文書群の中に特徴的に用いられている語を識別し、この語を用いて、当初の検索条件にＡＮＤ検索条件として付加した絞込検索条件式を作成し、絞込に有効な検索条件候補を自動生成できる。 In addition, using a system that holds an index, such as a related document search component, a word that is characteristically used in the search result document group is identified, and the original search condition is ANDed using this word. A refined search condition expression added as a search condition can be created, and search condition candidates effective for the refinement can be automatically generated.

すなわち、実施例のシステムでは、文書から文書内に出現する語とその頻度情報が得られ、検索結果文書群で特徴的な語群をスコア順に取得することができる。 That is, in the system of the embodiment, words appearing in the document and frequency information thereof are obtained from the document, and characteristic word groups in the search result document group can be acquired in the order of score.

文書管理システムに保持された文書に対して、属性検索／文書内容検索／関連文書検索用の各種のインデックスデータが揃っているので、属性検索や全文検索の結果に対して、関連文書検索用のインデックスデータを利用することで、全文検索による絞込みのために有効な検索語候補を容易に取得することができる。 Various types of index data for attribute search / document content search / related document search are available for documents held in the document management system. By using the index data, effective search word candidates for narrowing down by full-text search can be easily acquired.

なお、この発明は上述実施例に限定されるものではなく種々変更が可能である。例えば、上述例では、関連文書検索インデックスを用いて、検索結果の絞り込み用の語を選定するようにしたが、関連文書検索インデックスとは無関係に文書ごとに各単語の頻度（重要度）等の情報を保持するようにし、これを用いて検索結果文書群中で特徴的な語を選定するようにしてもよい。 In addition, this invention is not limited to the above-mentioned Example, A various change is possible. For example, in the above example, the related document search index is used to select a word for narrowing down the search results. However, the frequency (importance) of each word is determined for each document regardless of the related document search index. Information may be retained, and a characteristic word may be selected in the search result document group using the information.

また、検索結果文書群の属性情報（図８）から重要な属性値を抽出して、これを重要語として属性検索により絞り込み検索条件とすることができる。なお、この場合も、すべての検索結果文書が含む属性値は重要語から除かれる。この場合には前提となる検索（絞り込み前の検索）を関連文書検索としても良い。 Also, it is possible to extract important attribute values from the attribute information (FIG. 8) of the search result document group, and use this as an important word as a narrowing search condition by attribute search. In this case as well, the attribute values included in all search result documents are excluded from the important words. In this case, the search that is a premise (search before narrowing down) may be a related document search.

また、図１０に示すように、重要語（特徴語）ごとの絞込み率を計算して表示するようにしても良い。この絞込み率の計算は重要語抽出部１５で行なっても良いし、他に設けた機能ブロック（図示しない）で行なっても良い。この絞込み率は、例えば、検索結果文書のｔｆを参照して計算される。ｔｆの値はＩＤＦの値等とともに、例えば、登録文書群記憶部１２に登録文書に併せて管理データとして記憶されても良いし、関連文書検索用インデックスの補助的な情報として関連文書検索用インデックス記憶部２５に記憶されても良い。 Further, as shown in FIG. 10, the narrowing rate for each important word (feature word) may be calculated and displayed. The calculation of the narrowing-down rate may be performed by the keyword extraction unit 15 or may be performed by a function block (not shown) provided elsewhere. This narrowing rate is calculated with reference to tf of the search result document, for example. The value of tf may be stored as management data together with the registered document in the registered document group storage unit 12 together with the IDF value, or the related document search index as auxiliary information of the related document search index. You may memorize | store in the memory | storage part 25. FIG.

この発明の実施例の構成を全体として示すブロック図である。It is a block diagram which shows the structure of the Example of this invention as a whole. 上述実施例の動作例を説明する図である。It is a figure explaining the operation example of the said Example. 上述実施例の検索結果（絞込み検索）表示例を説明する図である。It is a figure explaining the example of a search result (narrowed search) display of the above-mentioned example. 上述実施例の検索条件入力フォームの例を説明する図である。It is a figure explaining the example of the search condition input form of the above-mentioned Example. 上述実施例の全文検索用インデックスのレコードの例を説明する図である。It is a figure explaining the example of the record of the index for full text search of the above-mentioned Example. 上述実施例の属性検索用インデックスのレコードの例を説明する図である。It is a figure explaining the example of the record of the index for attribute search of the above-mentioned Example. 上述実施例の関連文書検索用インデックスのレコードの例を説明する図である。It is a figure explaining the example of the record of the related document search index of the above-mentioned Example. 上述実施例の文書属性レコードの例を説明する図である。It is a figure explaining the example of the document attribute record of the above-mentioned Example. 上述実施例の他の全文検索用インデックスのレコードの例を説明する図である。It is a figure explaining the example of the record of the other full-text search index of the above-mentioned Example. 上述実施例の変形例を説明する図である。It is a figure explaining the modification of the above-mentioned Example.

Explanation of symbols

１０ユーザインタフェース部
１１属性管理部
１１文書管理部
１２登録文書群記憶部
１３文書検索部
１４検索結果取得部
１５重要語抽出部
１６表示合成部
１７インデックス生成部
２０全文検索実行部
２１全文検索用インデックス記憶部
２２属性検索実行部
２３属性検索用インデックス記憶部
２４関連文書検索実行部
２５関連文書検索用インデックス記憶部
１００文書管理システム
１０１コンピュータ
１０２記録媒体 DESCRIPTION OF SYMBOLS 10 User interface part 11 Attribute management part 11 Document management part 12 Registered document group memory | storage part 13 Document search part 14 Search result acquisition part 15 Key word extraction part 16 Display composition part 17 Index generation part 20 Full-text search execution part 21 Full-text search index Storage unit 22 Attribute search execution unit 23 Attribute search index storage unit 24 Related document search execution unit 25 Related document search index storage unit 100 Document management system 101 Computer 102 Recording medium

Claims

A search means for searching a document group;
Important phrase information storage means for storing important phrase information for each document;
Narrow word determination means for obtaining important word information of a document included in a search result of the search means with reference to the important word information storage means, and determining a narrowing word based on the acquired important word information;
A computer program for searching a document, characterized in that it is used to realize a search refinement instruction means for instructing the search means to perform a refinement search of the search result using the refinement word. .

2. The computer program for document search according to claim 1, wherein the search means is search means for performing full text search with reference to index data based on n-gram.

2. The computer program for document search according to claim 1, wherein the search means is search means for performing full text search with reference to index data prepared for each word.

2. The computer program for searching a document according to claim 1, wherein the search means is a search means for performing an attribute search with reference to index data prepared based on the attribute of the document.

5. The computer program for searching a document according to claim 1, wherein the important phrase information is a set of weights of words included in a predetermined word set prepared for each document.

6. The computer program for searching a document according to claim 5, wherein the weight of each word for each document is calculated based on the frequency of each word in the document and the number of documents in which the word appears.

7. The related document search means for searching for a document related to a predetermined word with reference to the important phrase information for each document stored in the important phrase information storage means. Computer program for document retrieval.

8. The narrowed-down word determination means, when narrowed-down word candidates are included in a search result document in a predetermined ratio or more, are excluded from narrowed-down words. Computer program.

The computer program for searching a document according to any one of claims 1 to 8, wherein, for each of the narrowing-down words, a ratio of a document including the narrowing-down word in a search result document is displayed.

A search means for searching a document group;
Important phrase information storage means for storing important phrase information for each document;
Narrowed word determination means for determining important phrase information of a document included in a search result of the search means with reference to the important phrase information storage means;
Search result / important phrase display means for displaying the search result and the important phrase;
A refinement search instruction means for instructing execution of a refinement search of the search result using the keyword to be operated in response to a user's operation on the displayed important phrase is used. A computer program for document retrieval.

A search means for searching a document group;
Important phrase information storage means for storing important phrase information for each document;
Narrow word determination means for obtaining important word information of a document included in a search result of the search means with reference to the important word information storage means, and determining a narrowing word based on the acquired important word information;
A document search apparatus comprising: a search refinement instruction means for instructing the search means to perform a refinement search of the search result using the refinement word.

A search means for searching a document group;
Important phrase information storage means for storing important phrase information for each document;
Means for determining important phrase information of a document included in a search result of the search means with reference to the important phrase information storage means;
Search result / important phrase display means for displaying the search result and the important phrase;
A document search apparatus comprising: a search refinement instruction means for instructing execution of a refinement search of the search result using the keyword to be operated in response to a user operation on the displayed keyword. .

A search step for searching the document group by a search means;
The refined word determination means refers to the important phrase information storage means for storing the important phrase information for each document, acquires the important phrase information of the document included in the search result of the search means, and based on the acquired important phrase information A refinement word determination step for determining a refinement word;
A refinement search instruction means, further comprising a refinement search instruction step for instructing the search means to perform a refinement search of the search result using the refinement word.

A search step for searching the document group by a search means;
A refinement word determination step for determining important phrase information of a document included in a search result of the search means by referring to an important phrase information storage means for storing important phrase information for each document by the refinement word determination means;
A display step of displaying the search result and the important phrase by the search result / important phrase display means;
A refinement search instruction means for instructing to perform a refinement search of the search result using the important word / phrase to be operated in response to a user's operation on the displayed important word / phrase. A document retrieval method as a feature.