JP2008181566A

JP2008181566A - Document retrieval device and document retrieval method

Info

Publication number: JP2008181566A
Application number: JP2008109517A
Authority: JP
Inventors: Masumi Narita; 真澄成田
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 2008-04-18
Filing date: 2008-04-18
Publication date: 2008-08-07
Anticipated expiration: 2020-11-06
Also published as: JP4373478B2

Abstract

<P>PROBLEM TO BE SOLVED: To obtain a high-precision retrieval result by giving higher weighting of words, which are extracted from a plurality of sections with duplication eliminated in a retrieval request that is input from a retrieval request inputting means, than the weighting of words extracted only from any one of sections. <P>SOLUTION: A document retrieval device includes a retrieval request inputting means 11, a retrieval condition generating means 13 and a document retrieving means 14, and extracts, from a plurality of documents that are objects to be retrieved in a document database 15, a document that meets the retrieval request composed of multiple sections, at least one of the sections being a sentence. The retrieval condition generating means 13 generates retrieval conditions by giving higher weighting of words, which are extracted from the plurality of sections with duplication eliminated in the retrieval request that is input from the retrieval request inputting means 11, than the weighting of words extracted only from any one of sections. The document retrieving means 13 retrieves, from the plurality of documents that are objects to be retrieved, a document that meets the retrieval conditions generated by the retrieval condition generating means. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は、文書検索装置及び文書検索方法に係り、特に、電子化された文書情報から検索要求に合致する文書を検索するための文書検索装置及び文書検索方法に関する。 The present invention relates to a document search apparatus and a document search method, and more particularly to a document search apparatus and a document search method for searching a document that matches a search request from digitized document information.

複数の文書情報を格納した文書データベースから特定の文書を検索するために文書検索装置が用いられる。このような文書検索装置は、入力された検索要求に合致する文書情報を文書データベースから抽出するものである。一般的に、入力された検索要求の内容をそのまま検索条件として使用することはできず、実際の検索に使用される検索条件は文書検索装置により生成される場合が多い。 A document retrieval apparatus is used to retrieve a specific document from a document database storing a plurality of document information. Such a document search apparatus extracts document information that matches an input search request from a document database. In general, the content of an input search request cannot be used as a search condition as it is, and the search condition used for the actual search is often generated by a document search apparatus.

たとえば、ユーザが入力する検索要求は、検索に不要な語句を含んでいる場合が多いので、入力された検索要求を言語解析して検索に不要な語句を除去するということが広く行われている。さらに、検索条件を生成する際に用いる要素として、単語だけでなく複数の単語からなるフレーズ（句）を採用する文書検索装置も多い。 For example, since a search request input by a user often includes words and phrases that are not necessary for the search, it is widely performed that the input search request is subjected to language analysis to remove unnecessary words and phrases for the search. . Further, many document search apparatuses employ not only words but also phrases (phrases) made up of a plurality of words as elements used when generating search conditions.

特許文献１では、与えられた検索要求を言語解析することにより検索条件を生成する方法を開示している。この方法では、入力された検索要求文に対して形態素解析を適用して検索要求文中の各々の単語を識別し、識別した単語を活用形へ展開したり、複合語を分解したりして、検索要求の同義表現を生成する。そして、検索要求語およびその同義表現と文書データベースとを照合して文書検索を行い、ユーザの検索意図に合致した文書を検索するようにしている。 Patent Document 1 discloses a method for generating a search condition by performing language analysis on a given search request. In this method, morphological analysis is applied to the input search request sentence to identify each word in the search request sentence, and the identified word is expanded into a utilization form, or a compound word is decomposed, Generate synonymous expressions for search requests. Then, the search request word and its synonymous expression are compared with the document database to perform a document search, and a document that matches the user's search intention is searched.

非特許文献１では、単語に加えてフレーズを検索語として用いる場合に、フレーズの有効な使い方として単語だけからなる検索語で初期検索を行い、その検索結果に対してフレーズを用いて検索文書の並べ替えを行うという手法を提案している。 In Non-Patent Document 1, when a phrase is used as a search word in addition to a word, an initial search is performed using a search word consisting only of the word as an effective usage of the phrase, and the search document is searched using the phrase for the search result. A method of sorting is proposed.

また、入力される検索要求が複数のセクション記述から構成される場合の検索条件の生成法も提案されている。例えば、情報検索の国際的なコンテストであるTREC(Text REtrieval Conference)では、検索要求が表１の例に示すように<title>,<desc>,<narr>という３つのセクション記述で与えられる。実際の検索処理では複数のセクション記述を用いるように義務づけられているわけではないが、複数のセクション記述を用いることにより検索に使用できる情報量が増えるという利点があるので、上記のセクション記述を適宜組み合わせて検索条件を生成する場合が多い。 Also, a search condition generation method has been proposed in the case where an input search request is composed of a plurality of section descriptions. For example, in TREC (Text REtrieval Conference) which is an international contest for information retrieval, a retrieval request is given by three section descriptions of <title>, <desc>, and <narr> as shown in the example of Table 1. Although it is not obliged to use multiple section descriptions in the actual search process, there is an advantage that the amount of information that can be used for search increases by using multiple section descriptions. In many cases, search conditions are generated in combination.

上記のように検索要求が複数のセクション記述から構成される場合の検索条件の生成では、いずれのセクションから抽出された語句であるかによってその語句に与える重み付けを変える。例えば、上記の例を用いると、<title>セクションから抽出された語句には高い重み付けを与え、<desc>セクションから抽出された語句には次に高い重み付けを与え、<narr>セクションから抽出された語句にはこの３つのセクションでは一番低い重み付けを与えるといったことがなされている(非特許文献２)。 In generating a search condition when a search request is composed of a plurality of section descriptions as described above, the weight given to the phrase is changed depending on which section the phrase is extracted from. For example, using the example above, the words extracted from the <title> section are given high weight, the words extracted from the <desc> section are given the next highest weight, and extracted from the <narr> section. The words are given the lowest weight in these three sections (Non-patent Document 2).

また、検索条件として用いられる語句のなかで重要度の高い語句を検索処理を実行する前に指定しておくという手法に関して、特許文献２では、ユーザが検索文字列を重要度に応じて視覚的に強調表示する方法が開示されている。この方法では、検索文字列の強調度を重要度に対応させることにより、この重要度を用いて検索文字列と検索文書の間の関連度を判定し、関連度の高い文書を優先的に検索結果としている。
特開平６−７５９９６号公報特開平９−１５３０６１号公報特開平６−５９９６号公報 M.Mitra,C.Buckley,A.Singhal,and C.Cardie.1997.“An analysis of statistical and syntactic phrases.”Proceedings of the Fifth RIAO Conference cf.S.E.Robertson,S.Walker,and M.Beaulieu.1999.“Okapi at trec-7.”Proceedings of the Seventh Text REtrieval Conference. Also, regarding a technique of designating a word having high importance among words used as a search condition before executing search processing, in Patent Document 2, a user visually selects a search character string according to importance. A method of highlighting is disclosed. In this method, the degree of emphasis of the search string is made to correspond to the importance, and the degree of association between the search string and the search document is determined using this importance, and documents with a high degree of relevance are preferentially searched. As a result.
Japanese Unexamined Patent Publication No. 6-75996 Japanese Patent Laid-Open No. 9-153061 Japanese Patent Laid-Open No. 6-5996 M.Mitra, C. Buckley, A. Singhal, and C. Cardie. 1997. “An analysis of statistical and syntactic phrases.” Proceedings of the Fifth RIAO Conference cf.SERobertson, S.Walker, and M.Beaulieu.1999. “Okapi at trec-7.” Proceedings of the Seventh Text REtrieval Conference.

しかしながら、上述のような、従来の文書検索装置には次のような問題がある。特許文献３に記載の発明では、検索語とその同義表現に対して同じ重み付けで検索条件が設定されており、検索候補の文書の関連度の計算において用いられる情報が少なく、精度の高い検索結果を得ることができないという問題がある。検索語として用いられる単語やフレーズが同じレベルで処理されるため、検索結果中にノイズが発生しやすくなる。 However, the conventional document retrieval apparatus as described above has the following problems. In the invention described in Patent Document 3, the search condition is set with the same weighting for the search term and its synonymous expression, and there is little information used in calculating the relevance of the search candidate document, and the search result is highly accurate. There is a problem that you can not get. Since words and phrases used as search terms are processed at the same level, noise is likely to occur in the search results.

単語だけで構成された検索条件による初期検索結果に対してフレーズを用いて検索文書の並べ替えをするという手法は、検索結果の再現率が低い場合には有効に働くという実験報告がなされているが、単語よりも意味的な情報がより凝縮されているとみなされるフレーズは初期検索においても有効に働くはずである。つまり、従来の手法は、検索語としてのフレーズの表現方法や重み付けに関して問題がある。 There has been an experimental report that the method of sorting search documents using phrases with respect to initial search results based on search conditions consisting only of words works effectively when the recall of search results is low. However, phrases that are considered to be more condensed of semantic information than words should work well in the initial search. That is, the conventional method has a problem regarding the expression method and weighting of the phrase as a search term.

さらに、入力された検索要求が複数のセクション記述から構成される場合には、セクション情報に基づいて検索語の重み付けを変えるという従来の手法では、（１）複数のセクションに共通して出現している語句に対する重み付けが考慮されない上に、（２）複数のセクションに共通して出現している単語と名詞句との間で異なる重み付けをすることができないという問題がある。
このような問題を解決するためには、例えば、より重要度の高い語句は複数のセクションにわたって共通して使用されることが多いという考えに基づき、複数のセクション記述に共通する語句には重み付けを高くすると同時に、その場合でも共通する単語と名詞句とでは重み付けを高くする割合を調整する必要がある。 Furthermore, when the input search request is composed of a plurality of section descriptions, the conventional method of changing the weight of the search word based on the section information is (1) appearing in common to a plurality of sections. There is a problem that weighting for a certain phrase is not considered, and (2) it is impossible to weight differently between a word that appears in common in a plurality of sections and a noun phrase.
In order to solve such problems, for example, based on the idea that more important words are often used in common across multiple sections, weights are given to words that are common to multiple section descriptions. At the same time, it is necessary to adjust the ratio of increasing the weight for common words and noun phrases even in that case.

また、特許文献２では、ユーザが検索語の重要度を自分で判断して所望の検索語を強調表示することによって重要度を直接指定するという形態をとるが、入力に手間がかかるだけでなく、重要度を直観的に判断するのは難しいという問題がある。 Further, in Patent Document 2, the user determines the importance of the search word by himself and directly designates the importance by highlighting the desired search word. There is a problem that it is difficult to intuitively judge the importance.

本発明は、上述の問題点に鑑みなされたものであり、自然言語で入力された複数のセクション記述からなる検索要求を言語解析して検索に適切な単語と名詞句を抽出し、単語と名詞句の間で適切な重み付けの調整をすると同時に、複数のセクションに共通する単語や名詞句に対して重み付けを高くすることにより、精度の高い検索結果を得ることを目的とする。
さらに、名詞句に対して２通りの表記を与え、一方の名詞句表記を初期検索に用い、他方の名詞句表記を初期検索結果の文書順位の並べ替えに用いることにより、精度の高い検索結果を得るようにした。 The present invention has been made in view of the above-described problems, and linguistically analyzes a search request including a plurality of section descriptions input in a natural language to extract words and noun phrases suitable for the search, and the words and nouns. An object is to obtain a highly accurate search result by adjusting weighting appropriately between phrases and at the same time increasing weighting for words and noun phrases common to a plurality of sections.
Furthermore, by giving two kinds of notation to the noun phrase, using one noun phrase notation for the initial search and using the other noun phrase notation for rearranging the document order of the initial search results, a highly accurate search result is obtained. To get.

請求項１の発明は、複数のセクションから構成され、該セクションのうち少なくとも一つは文である検索要求に合致した文書を検索対象である複数の文書から抽出する文書検索装置であって、前記検索要求を入力する検索要求入力手段と、前記検索要求入力手段により入力された前記検索要求のなかで複数の前記セクションから抽出され重複が除去された単語の重み付けをいずれか一つのセクションからのみ抽出された単語の重み付けより高くして、検索条件を生成する検索条件生成手段と、前記検索条件生成手段により生成した前記検索条件に合致した文書を検索対象である複数の文書から検索する文書検索手段と、を含むことを特徴としたものである。 The invention of claim 1 is a document search apparatus configured to extract a document that matches a search request, which is composed of a plurality of sections, at least one of which is a sentence, from the plurality of documents to be searched, A search request input means for inputting a search request, and a weight of a word extracted from a plurality of the sections and from which duplicates are removed in only one of the search requests input by the search request input means. A search condition generation unit for generating a search condition higher than the weighted word, and a document search unit for searching a document that matches the search condition generated by the search condition generation unit from a plurality of documents to be searched It is characterized by including these.

請求項２の発明は、少なくとも一つの文および一つのキーワードから構成される検索要求に合致した文書を検索対象である複数の文書から抽出する文書検索装置であって、前記検索要求を入力する検索要求入力手段と、前記検索要求入力手段により入力された前記検索要求のなかで前記文および前記キーワードから共通して抽出され重複が除去された単語の重み付けをいずれかの前記文または前記キーワードからのみ抽出された単語の重み付けより高くして、検索条件を生成する検索条件生成手段と、前記検索条件生成手段により生成した前記検索条件に合致した文書を検索対象である複数の文書から検索する文書検索手段と、を含むことを特徴としたものである。 The invention according to claim 2 is a document search apparatus for extracting a document that matches a search request composed of at least one sentence and one keyword from a plurality of documents to be searched, and for inputting the search request A request input means and a weight of a word that is extracted from the sentence and the keyword in the search request input by the search request input means in common and from which duplicates have been removed is weighted only from either the sentence or the keyword Search condition generation means for generating a search condition higher than the weight of the extracted word, and a document search for searching a document that matches the search condition generated by the search condition generation means from a plurality of documents to be searched Means.

請求項３の発明は、検索要求入力手段と、検索条件生成手段と、文書検索手段とを含み、複数のセクションから構成され、該セクションのうち少なくとも一つは文である検索要求に合致した文書を検索対象である複数の文書から抽出する文書検索方法であって、前記検索要求入力手段は、前記検索要求を入力し、前記検索条件生成手段は、前記検索要求入力手段により入力された前記検索要求のなかで複数の前記セクションから抽出され重複が除去された単語の重み付けをいずれか一つのセクションからのみ抽出された単語の重み付けより高くして、検索条件を生成し、前記文書検索手段は、前記検索条件生成手段により生成した前記検索条件に合致した文書を検索対象である複数の文書から検索することを特徴としたものである。 The invention according to claim 3 includes a search request input means, a search condition generation means, and a document search means, and is composed of a plurality of sections, and at least one of the sections matches a search request that is a sentence. A document search method for extracting a search request from a plurality of documents to be searched, wherein the search request input means inputs the search request, and the search condition generation means inputs the search input by the search request input means The document search means generates a search condition by setting a weight of a word extracted from a plurality of sections extracted from a plurality of requests in a request to be higher than a weight of a word extracted from only one section, and the document search means includes: A document that matches the search condition generated by the search condition generation unit is searched from a plurality of documents to be searched.

請求項４の発明は、検索要求入力手段と、検索条件生成手段と、文書検索手段とを含み、少なくとも一つの文および一つのキーワードから構成される検索要求に合致した文書を検索対象である複数の文書から抽出する文書検索装置による文書検索方法であって、前記検索要求入力手段は、前記検索要求を入力し、前記検索条件生成手段は、前記検索要求入力手段により入力された前記検索要求のなかで前記文および前記キーワードから共通して抽出され重複が除去された単語の重み付けをいずれかの前記文または前記キーワードからのみ抽出された単語の重み付けより高くして、検索条件を生成し、前記文書検索手段は、前記検索条件生成手段により生成した前記検索条件に合致した文書を検索対象である複数の文書から検索することを特徴としたものである。 The invention according to claim 4 includes a search request input means, a search condition generation means, and a document search means, and a plurality of documents that match a search request composed of at least one sentence and one keyword are to be searched. A document search method by a document search device that extracts from a document of the search request, wherein the search request input means inputs the search request, and the search condition generation means outputs the search request input by the search request input means. Among them, the weight of the word extracted in common from the sentence and the keyword and from which the duplicate is removed is set higher than the weight of the word extracted only from any of the sentence or the keyword, to generate a search condition, The document search means searches for a document that matches the search condition generated by the search condition generation means from a plurality of documents to be searched. One in which the.

本発明によれば、検索要求入力手段により入力された検索要求のなかで複数のセクションから抽出され重複が除去された単語の重み付けをいずれか一つのセクションからのみ抽出された単語の重み付けより高くしたので、検索精度を向上させることができる。 According to the present invention, the weighting of words extracted from a plurality of sections and from which duplicates are removed in the search request input by the search request input means is made higher than the weighting of words extracted from only one section. Therefore, search accuracy can be improved.

また、本発明によれば、検索要求入力手段により入力された検索要求のなかで文およびキーワードから共通して抽出され重複が除去された単語の重み付けをいずれかの前記文または前記キーワードからのみ抽出された単語の重み付けより高くしたので、検索精度を高めることができる。 Further, according to the present invention, the weights of words extracted in common from sentences and keywords in the search request input by the search request input means and from which duplicates are removed are extracted only from any of the sentences or keywords. Since the weighting is higher than the weighted word, the search accuracy can be improved.

図１は、本発明が適用される文書検索装置の一例を説明するためのブロック図で、本発明の実施の形態による文書検索装置は、入力部１と、表示部２と、中央演算装置（ＣＰＵ）を含む演算処理部３と、メモリ部４と、情報格納部５と、これらを接続するデータバス６よりなる。入力部１は、キーボード、マウス、タッチパネル等により構成され、ユーザが文書検索装置に情報を入力するために使用される。表示部２は、ＣＲＴディスプレイあるいは液晶ディスプレイ等よりなり、文書検索装置により得られた情報をユーザに対して表示したり、入力部１から入力された情報を表示する。演算処理部３は、所定のプログラムに基づいて文書検索処理を行う。メモリ部４は、演算処理部３が実行するプログラムを格納するＲＯＭと演算処理部が動作するときに必要な情報を一時的に格納するＲＡＭとにより構成される。情報格納部５は、ハードディスク装置等の比較的大容量の記憶装置よりなり、検索対象となる文書群が登録された文書データベースやプログラムを格納する。 FIG. 1 is a block diagram for explaining an example of a document search apparatus to which the present invention is applied. A document search apparatus according to an embodiment of the present invention includes an input unit 1, a display unit 2, a central processing unit ( CPU 3, an arithmetic processing unit 3, a memory unit 4, an information storage unit 5, and a data bus 6 connecting them. The input unit 1 includes a keyboard, a mouse, a touch panel, and the like, and is used for a user to input information to the document search apparatus. The display unit 2 includes a CRT display, a liquid crystal display, or the like, and displays information obtained by the document search device to the user, or displays information input from the input unit 1. The arithmetic processing unit 3 performs a document search process based on a predetermined program. The memory unit 4 includes a ROM that stores a program executed by the arithmetic processing unit 3 and a RAM that temporarily stores information necessary when the arithmetic processing unit operates. The information storage unit 5 is composed of a relatively large-capacity storage device such as a hard disk device, and stores a document database or program in which a document group to be searched is registered.

図２は、図１に示した文書検索装置の機能ブロック図で、図２における矢印は文書検索装置内の処理の流れを示している。検索要求入力手段１１は、ユーザが検索したい文書の内容を記述した自然言語を入力する機能であり、入力部１の機能に相当する。ここで、自然言語とは、例えば日本語、英語、独語、仏語等のような言語を意味し、検索対象となる文書も自然言語で表記されたものとする。ここで、ユーザが検索したい文書の内容を記述した情報を検索要求と称する。検索要求は、ユーザの検索意図を表す単語（群）あるいは文としてユーザによって与えられる。検索入力手段１１により入力された検索要求は言語解析手段１２に供給される。 FIG. 2 is a functional block diagram of the document search apparatus shown in FIG. 1, and arrows in FIG. 2 indicate the flow of processing in the document search apparatus. The search request input means 11 is a function for inputting a natural language describing the contents of a document that the user wants to search, and corresponds to the function of the input unit 1. Here, the natural language means a language such as Japanese, English, German, French, etc., and a document to be searched is also written in the natural language. Here, information describing the contents of a document that the user wants to search is referred to as a search request. The search request is given by the user as a word (group) or a sentence representing the user's search intention. The search request input by the search input unit 11 is supplied to the language analysis unit 12.

言語解析手段１２は、演算処理部（ＣＰＵ）３が所定のプログラムを実行することにより達成される。すなわち、言語解析手段１２は、検索要求を形態素解析して検索要求中の各々の単語を認識し、認識した単語の中から検索条件に適当な単語を抽出する。また、言語解析手段１２は、単語の品詞情報を基にした句分割規則を使用して、名詞句としてまとめられる単語群を抽出する。この処理は、名詞句分割と称される処理であり、言語解析の分野では周知の処理であるので、その説明は省略する。また、形態素解析処理も、言語解析の分野では周知の処理であり、その説明は省略する。 The language analysis means 12 is achieved by the arithmetic processing unit (CPU) 3 executing a predetermined program. That is, the language analyzing unit 12 recognizes each word in the search request by performing morphological analysis on the search request, and extracts a word suitable for the search condition from the recognized words. In addition, the language analysis unit 12 extracts a group of words that are grouped as a noun phrase by using a phrase division rule based on the part of speech information of the word. Since this process is a process called noun phrase division and is a well-known process in the field of language analysis, its description is omitted. The morphological analysis process is also a well-known process in the field of language analysis, and a description thereof will be omitted.

検索条件生成手段１３は、言語解析手段１２による処理結果を受け取り、抽出された単語及び名詞句を適切な演算子で結合して検索条件を生成する。検索条件生成手段１３は、演算処理部（ＣＰＵ）３が所定のプログラムを実行することにより達成される。検索条件生成手段１３により生成された検索条件は、文書検索手段１４に供給される。 The search condition generation unit 13 receives the processing result from the language analysis unit 12 and combines the extracted words and noun phrases with an appropriate operator to generate a search condition. The search condition generation unit 13 is achieved by the arithmetic processing unit (CPU) 3 executing a predetermined program. The search condition generated by the search condition generation unit 13 is supplied to the document search unit 14.

文書検索手段１４は、文書データベース１５に登録された文書情報を検索して、供給された検索条件に合致する文書情報を抽出する。文書検索手段１４は、演算処理部（ＣＰＵ）３が所定のプログラムを実行することにより達成される。文書検索手段１４により抽出された文書情報は、検索結果表示手段１６に供給される。 The document search means 14 searches the document information registered in the document database 15 and extracts the document information that matches the supplied search conditions. The document search means 14 is achieved by the arithmetic processing unit (CPU) 3 executing a predetermined program. The document information extracted by the document search unit 14 is supplied to the search result display unit 16.

検索結果表示手段１６は、表示部２の機能に相当し、検索結果として抽出された文書情報を表示する。これにより、検索要求を入力したユーザは検索結果を表示画面上で確認することができる。また、表示部２にプリンタを設けることにより、検索結果を印刷してもよい。 The search result display means 16 corresponds to the function of the display unit 2 and displays document information extracted as a search result. Thereby, the user who inputted the search request can check the search result on the display screen. The search result may be printed by providing a printer in the display unit 2.

次に、上述の言語解析手段１２の処理結果について説明する。言語解析手段１２の処理は、従来の言語解析手法を用いて行われるため、処理結果についてのみ説明する。ユーザは、入力部１（キーボード）を操作して検索要求を入力する。図３は、検索要求入力手段１１によりユーザが検索要求を入力したときの画面の一例で、本実施例では、図３に示すように、ユーザが入力する検索要求（条件）はキーワード記述と要求文記述の２つのセクションから構成されている。今、検索要求としてキーワード記述には“cigar smoking”という語句が、要求文記述には“Find documents that discuss the popularity of cigar smoking.”という英語の文章がユーザによって入力されたとする。図３の画面において、「初期件数」は検索結果として検索条件に合致する文章を３０件表示することを指定している。 Next, the processing result of the language analysis unit 12 will be described. Since the processing of the language analysis unit 12 is performed using a conventional language analysis method, only the processing result will be described. The user operates the input unit 1 (keyboard) to input a search request. FIG. 3 shows an example of a screen when the user inputs a search request using the search request input means 11. In this embodiment, as shown in FIG. 3, the search request (condition) input by the user is a keyword description and a request. It consists of two sections of sentence description. Assume that a user inputs a phrase “cigar smoking” as a search request and an English sentence “Find documents that discuss the popularity of cigar smoking” as a request description. In the screen of FIG. 3, “Initial number” specifies that 30 sentences matching the search condition are displayed as a search result.

言語解析手段１２は、キーワード記述セクションに入力された語句と要求文記述セクションに入力された語句を独立して処理する。その際、検索要求には冠詞、前置詞、接続詞といった検索に必要のない単語が含まれているので、言語解析手段１２は、入力された語句の中から検索に不要な語句を除去して検索に必要な語句のみを抽出する。不要な単語の除去は、予め作成しておいた不要語リストを参照しながら行われる。 The language analysis means 12 processes the phrase input in the keyword description section and the phrase input in the request sentence description section independently. At that time, since the search request includes words such as articles, prepositions, and conjunctions that are not required for the search, the language analysis unit 12 removes the words that are not necessary for the search from the input words and phrases. Extract only the words you need. Unnecessary words are removed while referring to an unnecessary word list created in advance.

不要語リストには、冠詞、前置詞、接続詞等の機能語や、ユーザの検索意図に関連しないと考えられる内容語が登録されている。すなわち、言語解析手段１２は、検索要求の各々の単語に対して不要語リストと照合し、不要語リストに登録されている単語を除去する。一方、言語解析手段１２による名詞句分割処理の結果として得られた名詞句については、名詞句を構成している単語に対して不要語リストとの照合が行われ、不要語リストに登録されている機能語のみが名詞句から除去される。これは、例えば、言語解析手段１２によって同定された“the World Court”という名詞句から、不要語リストにある機能語“the”が除去されることを指す。 In the unnecessary word list, function words such as articles, prepositions, conjunctions, and content words that are considered not to be related to the user's search intention are registered. That is, the language analysis unit 12 compares each word of the search request with the unnecessary word list and removes the word registered in the unnecessary word list. On the other hand, for the noun phrases obtained as a result of the noun phrase dividing process by the language analyzing unit 12, the words constituting the noun phrase are checked against the unnecessary word list and registered in the unnecessary word list. Only functional words that are present are removed from the noun phrase. This means, for example, that the function word “the” in the unnecessary word list is removed from the noun phrase “the World Court” identified by the language analysis means 12.

ここで、図３に示した検索要求に対する言語解析手段１２の処理結果について説明する。まず、キーワード記述からは“cigar”と“smoking”の２つの単語と“cigar smoking”という名詞句が抽出される。一方、要求文記述は９個の単語で構成されているが、予め作成しておいた不要語リストと照合することにより適切な単語のみが抽出される。本実施の形態では、不要語リストに“find”,“document”,“that”,“discuss”，“the”,“of”が登録されているものとする。従って、これらの不要語リストと照合することにより、“popularity”,“cigar”,“smoking”の３つの単語が最終的に抽出され、名詞句としては “cigar smoking”が抽出される。 Here, the processing result of the language analysis means 12 for the search request shown in FIG. 3 will be described. First, from the keyword description, two words “cigar” and “smoking” and a noun phrase “cigar smoking” are extracted. On the other hand, although the request sentence description is composed of nine words, only appropriate words are extracted by collating with a previously created unnecessary word list. In the present embodiment, “find”, “document”, “that”, “discuss”, “the”, “of” are registered in the unnecessary word list. Therefore, by collating with these unnecessary word lists, three words “popularity”, “cigar”, and “smoking” are finally extracted, and “cigar smoking” is extracted as a noun phrase.

＜キーワード記述から抽出された語句＞
単語：cigar,smoking 名詞句：cigar smoking
＜要求文記述から抽出された語句＞
単語：popularity,cigar,smoking 名詞句：cigar smoking <Phrases extracted from keyword descriptions>
Word: cigar, smoking noun phrase: cigar smoking
<Phrases extracted from request sentence description>
Word: popularity, cigar, smoking noun phrase: cigar smoking

このようにして、各セクション記述から抽出された単語と名詞句が検索条件を生成する要素となる。検索条件生成手段１３は、２つのセクションから抽出された単語と名詞句を照合してセクション間で重複する単語と名詞句は除去した後、単語と名詞句にそれぞれ適当な重み付けを施し、演算子で結合することにより検索条件を生成する。演算子としては、AND,ORのような論理演算子が使用される。また、近傍演算子としてWINDOWが使用され、重み付け演算子としてSCALEが使用される。 In this way, words and noun phrases extracted from each section description are elements that generate search conditions. The search condition generation means 13 collates the words extracted from the two sections and the noun phrases, removes the duplicate words and noun phrases between the sections, and then assigns appropriate weights to the words and noun phrases, respectively. A search condition is generated by combining with. As the operators, logical operators such as AND and OR are used. Further, WINDOW is used as the neighborhood operator, and SCALE is used as the weighting operator.

演算子ANDは、検索される文書中にこの演算子で結合された単語の全てが含まれる場合にその文書を検索結果として抽出することを指定するための演算子である。演算子ORは、検索される文書中にこの演算子で結合された単語のいずれか１つが含まれる場合にその文書を検索結果として抽出することを指定するための演算子である。 The operator AND is an operator for designating that a document to be retrieved is extracted as a retrieval result when all of the words combined by the operator are included in the retrieved document. The operator OR is an operator for designating that a document to be extracted as a search result when any one of words combined by the operator is included in the document to be searched.

また、演算子WINDOWは、名詞句を取り扱うために導入した演算子であり、この演算子で結合される２つの単語の間の距離と語順を指定する。例えば、#window[1,1,o]といった形式で表記される。括弧内の最初の数字と２番目の数字により単語の出現する範囲が規定され、３番目の文字は２つの単語の語順を表わしており、“o”は表記されたとおりの順序で２つの単語が出現することを指定している。すなわち、上記の例では２つの単語が表記された順番で隣接して出現することが指定される。 The operator WINDOW is an operator introduced to handle noun phrases, and specifies the distance and word order between two words combined by this operator. For example, it is written in the format #window [1,1, o]. The first number and the second number in parentheses define the range in which the word appears, the third character indicates the word order of the two words, and “o” is the two words in the order they appear Is specified to appear. That is, in the above example, it is specified that two words appear adjacent in the order in which they are written.

また、演算子SCALEは単語単位での検索条件と名詞句単位での検索条件とで重み付けの調整を行うための演算子である。例えば、#scale[0.5]というように表記した場合、これに続く検索条件の重み付けを０.５とすることを表わす。本実施の形態では、単語と名詞句とに異なる重み付けを施すことにより、検索結果の精度を向上させている。 The operator SCALE is an operator for adjusting the weighting between the search condition in units of words and the search condition in units of noun phrases. For example, a notation such as #scale [0.5] indicates that the weight of the subsequent search condition is 0.5. In the present embodiment, the accuracy of search results is improved by applying different weights to words and noun phrases.

本発明者は、様々な試行の結果、名詞句単位の検索条件に対する重み付けを単語単位の検索条件に対する重み付けより小さくすることにより、検索精度が向上することを見出した。本実施の形態では、各単語単位の検索条件に対する重み付けを１とし、名詞句単位の検索条件に対する重み付けを０.５としている。 As a result of various trials, the present inventor has found that the search accuracy is improved by making the weight for the search condition in units of noun phrases smaller than the weight for the search condition in units of words. In this embodiment, the weight for the search condition for each word unit is set to 1, and the weight for the search condition for the noun phrase unit is set to 0.5.

上述の演算子を使用して、本実施の形態において上述の検索要求から生成した検索条件は以下のようになる。
#or(cigar,smoking,popularity,#scale[0.5](#window[1,1,o](cigar,smoking))) The search conditions generated from the above search request in the present embodiment using the above operators are as follows.
#or (cigar, smoking, popularity, # scale [0.5] (# window [1,1, o] (cigar, smoking)))

検索条件生成手段１３により上記のような検索条件が生成されると、文書検索手段１４は文書データベース１５に登録された文書のうち検索条件に合致する文書を抽出する。このとき、検索条件に対して重み付けを考慮して得られた各々の文書のスコアを比較し、スコアの高い文書を検索条件に合致した文書として抽出する。この手法は、文書検索処理として周知の文書検索処理を用いており、その説明は省略する。 When the search condition as described above is generated by the search condition generation unit 13, the document search unit 14 extracts a document that matches the search condition from the documents registered in the document database 15. At this time, the scores of the respective documents obtained in consideration of weighting with respect to the search condition are compared, and a document having a high score is extracted as a document that matches the search condition. This method uses a well-known document search process as the document search process, and a description thereof will be omitted.

文書検索処理が終了すると、検索結果表示手段１６は、図４に示すように、検索結果としてスコアの高い文書から順番に画面に表示する。ここで、初期件数として３０件を表示することが指定されているため、スコアの高い順から３０件の文書を画面に表示する。図４の画面において、画面をスクロールすることにより、検索結果として抽出された３０件の文書を閲覧することができる。 When the document search process ends, the search result display means 16 displays the search results on the screen in order from the document with the highest score as shown in FIG. Here, since it is specified that 30 documents are displayed as the initial number, 30 documents are displayed on the screen in descending order of score. By scrolling the screen of FIG. 4, it is possible to browse 30 documents extracted as search results.

次に、本発明の他の実施形態について説明する。全体の機能構成は図２と同じであり、相違点は検索条件生成手段１３において、言語解析手段１２によって２つのセクションから抽出された単語と名詞句を照合した際に、セクション間で共通する単語と名詞句に対しては、いずれかのセクション記述にしか出現していない単語や名詞句よりも重み付けを高くすること、さらに、セクション間に共通する単語と名詞句とでは重み付けを高くする割合を変えるようにしたことにある。セクション間に共通する単語と名詞句は検索要求としての重要度が高い語句とみなされるため、これらの語句の重み付けを他の語句よりも高くすることで検索精度の向上がのぞめるからである。また、様々な試行の結果、セクション間に共通する単語と名詞句とでは、単語に対してより大きな重み付けを与えることにより検索精度が向上することを見い出した。このような重み調整を可能とするために、新たにLEVELという演算子を導入することにした。 Next, another embodiment of the present invention will be described. The overall functional configuration is the same as in FIG. 2, and the difference is that when the search condition generation unit 13 collates the words extracted from the two sections by the language analysis unit 12 with the noun phrase, the common word between the sections. And noun phrases should be weighted higher than words and noun phrases that appear only in one of the section descriptions. There is something to change. This is because words and noun phrases that are common between sections are regarded as words having a high importance as a search request, and thus the search accuracy can be improved by making these words more weighted than other words. Also, as a result of various trials, it was found that the search accuracy is improved by giving a greater weight to the word and the noun phrase common to the sections. In order to make such weight adjustment possible, we decided to introduce a new operator called LEVEL.

図３に示した検索要求を例にとると、本実施の形態では、以下に示すようにセクション間に共通する単語に対する重み付けを３とし、セクション間に共通する名詞句に対する重み付けを１．５としている。
#or(#level[3](#or(cigar,smoking)),popularity,#level[1.5](#scale[0.5](#window[1,1,o](cigar,smoking)))) Taking the search request shown in FIG. 3 as an example, in this embodiment, as shown below, the weight for words common to sections is set to 3, and the weight for noun phrases common to sections is set to 1.5. Yes.
#or (#level [3] (# or (cigar, smoking)), popularity, # level [1.5] (# scale [0.5] (# window [1,1, o] (cigar, smoking))))

次に、本発明の更に他の実施形態について説明する。図５は、全体の機能構成を示す図で、検索要求入力手段２１，言語解析手段２２の機能は図２の検索要求入力手段１１，言語解析手段１２と同じである。相違点は検索条件生成手段２３において名詞句に対する検索条件として名詞句を構成する単語が隣接して出現する条件と或る一定の距離内に離れて出現可能とする条件の両方を生成するために、各々の検索条件に対応する表記を与えるようにしたこと、さらには、初期文書検索手段２４では後者の名詞句検索条件を用いて文書データベース２５を検索して初期検索を行い、その検索結果を初期検索結果記憶手段２６によって一時的に初期検索結果文書データベース２７として格納しておき、この初期検索結果文書に対して文書再ランキング手段２８は前者の名詞句検索条件を用いて文書順位の並べ替えを行うようにしたことである。 Next, still another embodiment of the present invention will be described. FIG. 5 is a diagram showing the overall functional configuration. The functions of the search request input means 21 and the language analysis means 22 are the same as those of the search request input means 11 and the language analysis means 12 shown in FIG. The difference is that the search condition generation means 23 generates both a condition for allowing words constituting a noun phrase to appear adjacent to each other as a search condition for the noun phrase and a condition for allowing the words to appear apart within a certain distance. The initial document search means 24 searches the document database 25 using the latter noun phrase search conditions to perform an initial search, and the search result is obtained. The initial search result storage means 26 temporarily stores it as an initial search result document database 27. For this initial search result document, the document reranking means 28 rearranges the document order using the former noun phrase search condition. Is to do.

名詞句に対して後者の検索条件を新たに生成するのは、言語解析手段２２によって抽出された名詞句を構成している単語と単語は検索要求に適合する文書において比較的近傍で共起する可能性が高いことを考慮に入れ、検索漏れを減らすためである。さらに、この検索条件を用いて初期検索を行うことによってユーザの検索意図に関連しそうな文書を多く抽出しておき、よりきつい制約が課される前者の検索条件を用いて文書の関連度を再計算して文書順位の並べ替えを行うことにより検索精度を向上させるためである。 The latter search condition is newly generated for the noun phrase because the words and words constituting the noun phrase extracted by the language analysis means 22 co-occur in a relatively close vicinity in the document that matches the search request. This is to reduce the omission of search in consideration of the high possibility. Furthermore, by performing an initial search using this search condition, a large number of documents that are likely to be related to the user's search intention are extracted, and the relevance of the document is re-established using the former search condition, which is more restrictive. This is for improving the search accuracy by rearranging the document order by calculation.

例えば、前述の検索要求から言語解析手段２２によって抽出された名詞句“cigar smoking”に対して、単語“cigar”と単語“smoking”がこの順序で隣接して出現する名詞句本来の検索条件とこれらの単語が或る一定の距離内に離れて出現可能とする検索条件を生成する。また、名詞句を構成する単語が出現順序を問わず離れて出現可能な距離は、本実施の形態では同一文中内と考え３０語に設定している。 For example, with respect to the noun phrase “cigar smoking” extracted by the language analysis means 22 from the above search request, the original search condition of the noun phrase in which the word “cigar” and the word “smoking” appear adjacently in this order A search condition that allows these words to appear within a certain distance is generated. Further, in this embodiment, the distance at which the words constituting the noun phrase can appear apart regardless of the order of appearance is set to 30 words in the same sentence.

この結果、図３の検索要求に対して、初期検索に用いる検索条件と文書順位の並べ替えに用いる検索条件は以下のようになる。また、請求項２記載の発明の実施形態では、以下の検索条件に対してさらに LEVEL 演算子による重み付けがなされる。 As a result, the search condition used for the initial search and the search condition used for rearranging the document order in response to the search request in FIG. 3 are as follows. In the embodiment of the present invention, the following search condition is further weighted by the LEVEL operator.

＜ａ．初期検索に用いる検索条件＞
#or(cigar,smoking,popularity,#scale[0.5](#window[1,30,u](cigar,smoking)))
＊#window[1,30,u](cigar, smoking)は、“cigar”と“smoking”が任意の順序で１〜３０語の範囲に出現することを指定している。

＜ｂ．初期検索結果の文書順位の並べ替えに用いる検索条件＞
#or(cigar,smoking,popularity,#scale[0.5](#window[1,1,o](cigar,smoking))) <A. Search conditions used for initial search>
#or (cigar, smoking, popularity, # scale [0.5] (# window [1,30, u] (cigar, smoking)))
* # Window [1,30, u] (cigar, smoking) specifies that "cigar" and "smoking" appear in the range of 1 to 30 words in any order.

<B. Search conditions used to sort document order of initial search results>
#or (cigar, smoking, popularity, # scale [0.5] (# window [1,1, o] (cigar, smoking)))

初期文書検索手段２４では、初期検索用に生成された検索条件ａを用いて検索対象文書との関連度計算を行ない、各文書にスコアを与え、スコアの高い文書から指定された「初期件数」に相当する数（図３と図４の例では３０件）の文書を初期検索結果として抽出する。関連度は、検索条件を構成している単語及び名詞句の当該文書内における出現頻度、これらの語句が出現する文書数、これらの語句に対する重み付け等を使って計算される。 The initial document search means 24 calculates the relevance with the search target document using the search condition a generated for the initial search, gives a score to each document, and designates the “initial number” specified from the documents with high scores. (30 in the example of FIGS. 3 and 4) corresponding to the number of documents is extracted as an initial search result. The relevance is calculated using the frequency of appearance of the words and noun phrases constituting the search condition in the document, the number of documents in which these phrases appear, the weights for these phrases, and the like.

次に、初期文書検索手段２４によって抽出された３０件の文書に対して、文書再ランキング手段２８は前記ｂの検索条件を用いて検索条件との関連度を再計算する。関連度の再計算によって３０件の各文書には新しいスコアが与えられ、スコアの高い順番に文書が並べ替えられる。 Next, for 30 documents extracted by the initial document search means 24, the document reranking means 28 recalculates the degree of relevance with the search conditions using the search condition b. Recalculation of relevance gives a new score to each of the 30 documents, and the documents are rearranged in descending order of score.

上述のように、本実施の形態では、ユーザが入力した複数のセクション記述からなる検索要求から言語解析手段によって検索条件の要素となる単語と名詞句を抽出し、これらの語句に適切な重み付けを施して演算子により結合して検索条件を生成するため、検索漏れを低減し、検索精度を高めることができる。また、複数のセクション記述に共通して使われている語句、すなわち、重要度がより高いとみなされる単語及び名詞句の重み付けを高くすると同時に、その場合でも、単語と名詞句の間で重み付けを高くする割合を変えることで、より検索精度を高めることができる。さらに、言語解析手段によって抽出された名詞句に対して２通りの検索条件としての表記を与え、一方の表記による初期検索を実施して得られた検索文書に対して他方の表記による文書順位の並べ替えを行って最終的な検索結果を得ることで、検索漏れを低減すると同時に検索精度を高めることができる。 As described above, in the present embodiment, words and noun phrases that are elements of search conditions are extracted by a language analysis unit from a search request including a plurality of section descriptions input by a user, and appropriate weights are assigned to these phrases. Since the search conditions are generated by combining them with operators, search omissions can be reduced and search accuracy can be improved. Also, increase the weighting of words and noun phrases that are commonly used in multiple section descriptions, that is, words and noun phrases that are considered to be more important. By changing the rate of increase, the search accuracy can be further increased. Furthermore, two types of search conditions are given to the noun phrases extracted by the language analysis means, and the document order of the other notation is obtained for the search document obtained by performing the initial search using one of the notations. By rearranging and obtaining a final search result, it is possible to reduce search omissions and at the same time improve search accuracy.

なお、以上に説明した実施形態では、英文による文書を検索対象文書としたが、例えば、日本語、独語、仏語等の他の言語による文書でも本発明による文書検索を適用することもできる。 In the embodiment described above, an English document is used as a search target document. However, for example, a document search according to the present invention can be applied to a document in another language such as Japanese, German, or French.

本発明が適用される文書検索装置の一例を説明するためのブロック図である。It is a block diagram for demonstrating an example of the document search apparatus with which this invention is applied. 図１に示した文書検索装置の機能ブロック図である。It is a functional block diagram of the document search apparatus shown in FIG. 検索要求入力の画面の一例を示す図である。It is a figure which shows an example of the screen of a search request input. 検索結果出力の画面の一例を示す図である。It is a figure which shows an example of the screen of search result output. 本発明の他の実施例を説明するための機能ブロック図である。It is a functional block diagram for demonstrating the other Example of this invention.

Explanation of symbols

１…入力部、２…表示部、３…演算処理部、４…メモリ部、５…情報格納部、６…データバス、１１…検索入力手段、１２…言語解析手段、１３…検索条件生成手段、１４…文書検索手段、１５…文書データベース、１６…検索結果表示手段、２１…検索要求入力手段、２２…言語解析手段、２３…検索条件生成手段、２４…初期文書検索手段、２５…文書データベース、２６…初期検索結果記憶手段、２７…初期検索結果文書データベース、２８…文書再ランキング手段、２９…最終結果表示手段。 DESCRIPTION OF SYMBOLS 1 ... Input part, 2 ... Display part, 3 ... Operation processing part, 4 ... Memory part, 5 ... Information storage part, 6 ... Data bus, 11 ... Search input means, 12 ... Language analysis means, 13 ... Search condition generation means , 14 ... Document search means, 15 ... Document database, 16 ... Search result display means, 21 ... Search request input means, 22 ... Language analysis means, 23 ... Search condition generation means, 24 ... Initial document search means, 25 ... Document database , 26 ... initial search result storage means, 27 ... initial search result document database, 28 ... document reranking means, 29 ... final result display means.

Claims

A document search device configured to extract a document that matches a search request, which is a sentence, from a plurality of documents to be searched.
Search request input means for inputting the search request;
In the search request input by the search request input means, a search is performed by setting the weight of the word extracted from the plurality of sections extracted from the plurality of sections to be higher than the weight of the word extracted from only one section. Search condition generation means for generating a condition;
A document search unit that searches a plurality of documents to be searched for documents that match the search condition generated by the search condition generation unit;
A document retrieval apparatus comprising:

A document search device that extracts a document that matches a search request including at least one sentence and one keyword from a plurality of documents to be searched,
Search request input means for inputting the search request;
In the search request input by the search request input means, the weights of words extracted in common from the sentence and the keyword and from which duplicates are removed are weighted for the words extracted only from either the sentence or the keyword. A search condition generating means for generating a search condition higher than the weight,
A document search unit that searches a plurality of documents to be searched for documents that match the search condition generated by the search condition generation unit;
A document retrieval apparatus comprising:

A search request input means, a search condition generation means, and a document search means,
A document search method comprising a plurality of sections, wherein at least one of the sections extracts a document that matches a search request that is a sentence from a plurality of documents to be searched,
The search request input means inputs the search request,
The search condition generation means weights the words extracted from a plurality of sections extracted from the plurality of sections in the search request input by the search request input means and extracted from only one section. Generate search criteria higher than weighting,
The document search method, wherein the document search means searches for a document that matches the search condition generated by the search condition generation means from a plurality of documents to be searched.

A document search that includes a search request input unit, a search condition generation unit, and a document search unit, and extracts documents that match a search request composed of at least one sentence and one keyword from a plurality of documents to be searched. A document search method using a device,
The search request input means inputs the search request,
The search condition generation means assigns a weight of a word extracted from the sentence and the keyword in the search request input by the search request input means in common and removed from any one of the sentence or the keyword Generate a search condition with a higher weight than words extracted only from
The document search method, wherein the document search means searches for a document that matches the search condition generated by the search condition generation means from a plurality of documents to be searched.