JP4009937B2

JP4009937B2 - Document search device, document search program, and medium storing document search program

Info

Publication number: JP4009937B2
Application number: JP2002004805A
Authority: JP
Inventors: 大二郎森; 正之杉崎; 聡哉栗島; 博人稲垣
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2002-01-11
Filing date: 2002-01-11
Publication date: 2007-11-21
Anticipated expiration: 2022-01-11
Also published as: JP2003208447A

Description

【０００１】
【発明の属する技術分野】
本発明は、大量の文書情報を蓄積し、入力された文字列（語句）を含む文書を検索して提示する文書検索技術に関する。
【０００２】
【従来の技術】
近年、インターネットの普及に代表される情報流通インフラの急速な整備に伴い、大量の文書情報（以下、単に文書）が流通するようになった。これらの文書を網羅的に収集し、検索要求として入力された語句を本文に含む文書を検索して提示するシステムが実現されている。特にＷＷＷ（ＷｏｒｌｄＷｉｄｅＷｅｂ）上の文書を検索するＷｅｂサーチエンジンと呼ばれるサービスとして、ｇｏｏ（ｈｔｔｐ：／／ｗｗｗ．ｇｏｏ．ｎｅ．ｊｐ／）やｇｏｏｇｌｅ（ｈｔｔｐ：／／ｗｗｗ．ｇｏｏｇｌｅ．ｃｏｍ／）などが実現されている。
【０００３】
これらのＷｅｂサーチエンジンでは、入力された検索要求語句に対して、数千件〜数百万件という大量の文書が検索結果として得られる場合が少なくない。このような大量の検索結果文書の全てに利用者が目を通すことは事実上不可能であるから、検索要求語句により良く適合していると考えられる文書を選りすぐって提示したり、検索結果文書の集合を分析し、サイト単位や文書の主題単位に検索結果文書をグループ化して表示し、利用者が求める文書を的確に絞り込むことを支援していた。
【０００４】
自然言語では一つの語句が複数の意味を持つ、即ち多義性を持つことがしばしばあるため、ある語句によって検索を行った時、異なる語義でその語句が用いられている文書が検索結果として混在して出力されることがある。これらの文書の中から利用者が求める文書を的確に絞り込む手段として、文書の主題によって検索結果を分類し、検索結果をグループ化して提示する技術は有用である。
【０００５】
検索結果の分類を行う手法としては、ＴＦ×ＩＤＦ法等を用いて各文書から特徴的な単語（語句）を抽出し、これらの語句数分の次元のベクトル空間に文書を配置し、文書ベクトルのなす角の余弦によって文書間の類似度を定義して、類似度の高い文書をクラスタ分析を用いて分類する手法等が用いられる。
【０００６】
なお、ＴＦ×ＩＤＦ法の詳細については、例えば、G. Salton, "Automatic Text Processing" 1989, Addison-Wesley Pubrishing等に記載されているが、文書集合全体の中で出現頻度がより少ない語句を重要な語句とみなし、ある文書の中で出現頻度がより高い語句をその文書の特徴を良く表す言葉だとみなすという二つの原理に基づいて、文書と語句との適合度を算出する手法である。
【０００７】
【発明が解決しようとする課題】
しかしながら、検索結果文書の数が大量である場合、全ての文書に対して上記のようなクラスタ分析を行うと、検索応答時間が長くなってしまうという問題があった。
【０００８】
また、ＴＦ×ＩＤＦ法を用いて各文書の特徴量を算出する場合、各語句の重要度は静的で変化しないものとみなすことになるが、実際には、ごく一般的な語句が、例えば本や映画の表題に用いられたりすると、利用者にとって重要度の高い語句とみなされるようになったり、ごくありふれた姓と名の組み合わせが有名人の名前として認識され、利用者にとって重要度の高いフレーズとなるようなケースがしばしばあるにも拘わらず、これを反映して的確に分類を行うことができなかった。
【０００９】
本発明の目的は、検索応答時間を短くし、利用者にとっての語句の重要度の動的な変化に対応して的確な分類を可能とすることにある。
【００１０】
【課題を解決するための手段】
前記課題を解決するため、本発明では、予め蓄積した検索対象文書の集合から所望の語句を含む文書を検索し提示する文書検索装置であって、利用者によって入力された語句を検索要求語句として受け取る検索要求入力手段と、検索対象文書の集合から前記利用者によって入力された検索要求語句を含む文書を検索し出力する文書検索手段と、文書検索手段が出力する文書のうち、前記利用者によって入力された検索要求語句と関連度の高い関連語句を含む文書を、各関連語句毎にグループ化して表示する検索結果表示手段とを備えた文書検索装置を提案する。
【００１２】
ここで、検索結果表示手段として、検索要求語句の履歴（過去に入力され、使用された検索要求語句の集合）中から前記利用者によって入力された検索要求語句に隣接している語句を抽出し、その出現頻度を算出し、該頻度が高い語句を前記利用者によって入力された検索要求語句と関連度の高い関連語句とみなし、文書検索手段が出力する文書のうち、前記利用者によって入力された検索要求語句とこれらの関連語句との連接語句を含む文書を、各関連語句毎にグループ化して表示する検索結果表示手段を用いても良く、検索入力語句と隣接する語句とで構成されるフレーズのうち、検索要求語句の履歴中に高い頻度で現れるフレーズはその時点で多くの利用者が共通に求める傾向が高いフレーズであると考えることができるから、これらのフレーズを分類の基準とみなすことによって、利用者にとって重要度の高い概念によって検索結果を分類し、提示することが可能となる。
【００１３】
以上の通り、本発明によれば、検索応答時間が長くなってしまうという課題と、利用者にとっての語句の重要度の動的な変化に対応して的確な分類が行えないという課題の両方を解決することが可能となる。
【００１４】
【発明の実施の形態】
次に、本発明の実施の形態について図面を参照して詳細に説明する。
【００１５】
図１は、本発明の文書検索装置の実施の形態の一例を示すもので、図中、１は検索要求入力処理部、２は文書検索処理部、３は検索結果表示処理部、４は予め蓄積した検索対象文書の集合（を記憶した記憶部）である。
【００１６】
検索要求入力処理部１は、利用者によって入力された語句を検索要求語句として受け取り、これを文書検索処理部２に引き渡す。
【００１７】
文書検索処理部２は、検索対象文書の集合４から前記検索要求語句を含む文書を検索し、これを検索結果文書として検索結果表示処理部３に出力する。
【００１８】
検索結果表示処理部３は、検索結果文書のうち、前記検索要求語句と関連度の高い関連語句を含む文書を、各関連語句毎にグループ化して表示する。
【００１９】
検索結果表示処理部３において、何をもって関連語句とするか（何から抽出し、どのようにして決定するか）は種々考えられるが、本発明では、検索要求語句に隣接して現れる語句を関連語句の候補として、検索要求語句の履歴から検索要求語句に隣接して現れる語句を抽出し、隣接して現れる回数が多い語句（出現頻度の高い語句）を関連語句とみなし、検索要求語句とこれらの関連語句との連接語句を含む文書を、各関連語句毎にグループ化して表示する。
【００２０】
また、何をもって隣接して現れる回数が多い（出現頻度が高い）とするかについても種々考えられるが、（１）全検索結果文書における出現割合が一定以上、（２）出現回数の多い順に上位何番目まで、（３）これらの組み合わせ等が考えられる。
【００２１】
なお、抽出対象とする隣接語句は検索要求語句（の品詞）によっても変わるが、通常、名詞（人名、地名等の固有名詞を含む）であり、場合によっては形容詞、副詞等を含めても良いが、接続詞、接頭語、接尾語等は対象外とする。
【００２２】
図２は、文書検索及び検索結果表示処理の一例（但し、特許請求の範囲には含まれない。）、ここでは検索結果文書中の検索要求語句の隣接語句を関連語句とする場合の例を示す流れ図である。
【００２３】
即ち、検索要求入力処理部１において利用者より入力された検索要求語句に基づき、文書検索処理部２において検索対象文書の集合４から該検索要求語句を含む文書を検索し、検索結果文書として出力した後（Ｓ１）、検索結果表示処理部３において検索結果文書中から検索要求語句に隣接している語句を抽出し、その出現頻度を算出し（Ｓ２）、該頻度が高い語句を関連語句と見なし、検索要求語句とこれらの関連語句との連接語句を含む文書を、各関連語句毎にグループ化し（Ｓ３）、出現頻度が高いものから順に表示する（Ｓ４）。
【００２４】
図３にこの方法による検索結果表示処理の具体例を示す。
【００２５】
図４は、文書検索及び検索結果表示処理の他の例、ここでは検索要求語句の履歴中の検索要求語句の隣接語句を関連語句とする場合の例を示す流れ図である。
【００２６】
即ち、検索要求入力処理部１において利用者より入力された検索要求語句に基づき、文書検索処理部２において検索対象文書の集合４から該検索要求語句を含む文書を検索し、検索結果文書として出力した後（Ｓ１）、検索結果表示処理部３において検索要求語句の履歴（過去に入力され、使用された検索要求語句の集合）中から検索要求語句に隣接している語句を抽出し、その出現頻度を算出し（Ｓ１１）、該頻度が高い語句を関連語句と見なし、検索要求語句とこれらの関連語句との連接語句を含む文書を、各関連語句毎にグループ化し（Ｓ３）、出現頻度が高いものから順に表示する（Ｓ４）。
【００２７】
図５にこの方法による検索結果表示処理の具体例を示す。
【００２８】
この際、対象とする検索要求語句の履歴の範囲を、検索要求語句が入力された時刻から遡って一定期間（例えば７２時間の範囲等）内に限定し、この中から該当する語句の出現頻度を求めることにより、流行等の要因による検索要求語句の重要度の変動により的確に追従することができる。
【００２９】
【発明の効果】
本発明の効果は、第１に、検索要求語句に対する関連語句を抽出し、それに関する分類を行うことにより、クラスタリング等の手法に比べ、より短時間に分類処理を行えることであり、第２に、検索要求語句の重要度が流行等を反映して変動する事象に追従でき、多くの利用者の関心に沿った語句によって分類が可能になることである。
【図面の簡単な説明】
【図１】本発明の文書検索装置の実施の形態の一例を示す構成図
【図２】文書検索及び検索結果表示処理の一例を示す流れ図
【図３】図２の処理の流れに従う検索結果表示処理の具体例を示す説明図
【図４】文書検索及び検索結果表示処理の他の例を示す流れ図
【図５】図４の処理の流れに従う検索結果表示処理の具体例を示す説明図
【符号の説明】
１：検索要求入力処理部、２：文書検索処理部、３：検索結果表示処理部、４：検索対象文書の集合。[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a document retrieval technique for accumulating a large amount of document information and retrieving and presenting a document including an inputted character string (phrase).
[0002]
[Prior art]
In recent years, with the rapid development of information distribution infrastructure represented by the spread of the Internet, a large amount of document information (hereinafter simply referred to as documents) has been distributed. A system has been realized in which these documents are collected in an exhaustive manner, and a document that includes a word / phrase inputted as a search request in the text is searched and presented. In particular, as a service called a Web search engine for searching a document on the WWW (World Wide Web), goo (http://www.goo.ne.jp/) and Google (http://www.google.com/) Etc. are realized.
[0003]
In these Web search engines, a large number of documents of thousands to millions of documents are often obtained as search results for the input search request phrases. Since it is virtually impossible for a user to look through all of such a large number of search result documents, a document that is considered to be better suited to the search request phrase is selected and presented, or a search result document is displayed. The search result documents are grouped and displayed in units of sites and the subject units of the documents, and the user's desired documents are narrowed down.
[0004]
In natural language, a single phrase often has multiple meanings, that is, ambiguity, so when a search is performed using a certain phrase, documents that use that phrase with different meanings are mixed as search results. May be output. A technique for classifying search results according to the subject matter of the documents and presenting the search results in a group is useful as means for accurately narrowing down the documents requested by the user from these documents.
[0005]
As a method for classifying search results, a characteristic word (phrase) is extracted from each document using a TF × IDF method or the like, and the document is arranged in a vector space of a dimension corresponding to the number of these phrases. For example, a method is used in which the similarity between documents is defined by the cosine of the angle formed by and the documents having a high similarity are classified using cluster analysis.
[0006]
The details of the TF × IDF method are described in, for example, G. Salton, “Automatic Text Processing” 1989, Addison-Wesley Pubrishing, etc. This is a technique for calculating the degree of matching between a document and a phrase based on two principles that it is regarded as a simple phrase and a phrase with a higher appearance frequency in a document is regarded as a word that well represents the characteristics of the document.
[0007]
[Problems to be solved by the invention]
However, when the number of search result documents is large, if the above cluster analysis is performed on all documents, there is a problem that the search response time becomes long.
[0008]
Also, when calculating the feature value of each document using the TF × IDF method, the importance of each word is considered to be static and does not change. When used in the title of a book or movie, it will be regarded as a phrase that is important to the user, or a combination of a surname and a surname that is very common will be recognized as a celebrity name, and a phrase that is important to the user In spite of the fact that there are many cases that become such cases, it was not possible to accurately classify them reflecting this.
[0009]
An object of the present invention is to shorten the search response time and enable accurate classification corresponding to dynamic changes in the importance of phrases for users.
[0010]
[Means for Solving the Problems]
In order to solve the above-described problem, the present invention is a document search apparatus that searches and presents a document including a desired phrase from a set of previously stored search target documents, and uses a phrase input by a user as a search request phrase. A search request input means for receiving, a document search means for searching and outputting a document including a search request phrase input by the user from a set of search target documents, and a document output by the user by the user among the documents output by the document search means A document search apparatus is provided that includes search result display means for grouping and displaying documents including related phrases that are highly related to the input search request phrase.
[0012]
Here, as a search result display means, a phrase adjacent to the search request phrase input by the user is extracted from the history of search request phrases (a set of search request phrases input and used in the past). The appearance frequency is calculated, the high frequency word / phrase is regarded as the related word / phrase having a high degree of association with the search request word / phrase inputted by the user, and the document search means outputs the document inputted by the user. Search result display means for grouping and displaying documents including concatenated phrases of the search request phrases and these related phrases may be used. The search result display means is composed of search input phrases and adjacent phrases. Of the phrases, phrases that appear frequently in the history of search request phrases can be considered as phrases that many users tend to seek in common at that time. By considering the classification of the reference phrases, it is possible to classify the search results by the high concept of importance for the user, presented.
[0013]
As described above, according to the present invention, both the problem that the search response time becomes long and the problem that accurate classification cannot be performed in response to the dynamic change of the importance of the phrase for the user are performed. It can be solved.
[0014]
DETAILED DESCRIPTION OF THE INVENTION
Next, embodiments of the present invention will be described in detail with reference to the drawings.
[0015]
FIG. 1 shows an example of an embodiment of a document search apparatus according to the present invention. In the figure, 1 is a search request input processing unit, 2 is a document search processing unit, 3 is a search result display processing unit, This is a set of stored search target documents (a storage unit that stores them).
[0016]
The search request input processing unit 1 receives a phrase input by the user as a search request word and passes it to the document search processing unit 2.
[0017]
The document search processing unit 2 searches for a document including the search request word / phrase from the set 4 of search target documents, and outputs this as a search result document to the search result display processing unit 3.
[0018]
The search result display processing unit 3 displays, among search result documents, documents including related words / phrases having high relevance to the search request word / phrase, grouped for each related word / phrase.
[0019]
In the search result display processing unit 3, there are various possible ways to determine the related word (extract from what is determined and how to determine it). In the present invention, the word that appears adjacent to the search request word is related. as the phrase candidates, extracts words appearing adjacent to the search request terms from the history of the search request term, regarded as adjacent number appearing often words (the high frequency terms) related phrases, the search request terms Documents containing concatenated phrases with these related phrases are displayed grouped for each related phrase.
[0020]
In addition, there are various possible ways of determining the number of adjacent appearances (the appearance frequency is high), but (1) the appearance ratio in all search result documents is greater than or equal to a certain level, and (2) the higher order in the number of appearances. To what number, (3) a combination of these is conceivable.
[0021]
In addition, although the adjacent phrase to be extracted varies depending on the search request phrase (part of speech), it is usually a noun (including proper nouns such as personal names and place names), and may include adjectives, adverbs, etc. in some cases. However, conjunctions, prefixes, suffixes, etc. are excluded.
[0022]
FIG. 2 shows an example of document search and search result display processing (however, it is not included in the scope of claims) . Here, an example in which the adjacent phrase of the search request phrase in the search result document is used as the related phrase. It is a flowchart shown.
[0023]
That is, based on the search request phrase input by the user in the search request input processing unit 1, the document search processing unit 2 searches for a document including the search request phrase from the set 4 of search target documents and outputs it as a search result document. After that (S1), the search result display processing unit 3 extracts words / phrases adjacent to the search request word / phrase from the search result document, calculates the appearance frequency (S2), and uses the word / phrase with high frequency as the related word / phrase. Assuming that the documents including the search request phrases and the concatenated phrases of these related phrases are grouped for each related phrase (S3) and displayed in descending order of appearance frequency (S4).
[0024]
FIG. 3 shows a specific example of search result display processing by this method.
[0025]
FIG. 4 is a flowchart showing another example of the document search and search result display processing, here an example in which the adjacent phrase of the search request phrase in the history of the search request phrase is a related phrase.
[0026]
That is, based on the search request phrase input by the user in the search request input processing unit 1, the document search processing unit 2 searches for a document including the search request phrase from the set 4 of search target documents and outputs it as a search result document. After that (S1), the search result display processing unit 3 extracts a phrase adjacent to the search request phrase from the history of the search request phrase (a set of search request phrases input and used in the past), and its appearance The frequency is calculated (S11), the phrase having the high frequency is regarded as the related phrase, and the documents including the search request phrase and the concatenated phrase of these related phrases are grouped for each related phrase (S3). Display in order from the highest (S4).
[0027]
FIG. 5 shows a specific example of search result display processing by this method.
[0028]
At this time, the history range of the target search request phrase is limited to a certain period (for example, a range of 72 hours, etc.) retroactively from the time when the search request phrase is input, and the appearance frequency of the corresponding phrase from this is limited. By obtaining the above, it is possible to accurately follow the change in the importance of the search request phrase due to factors such as fashion.
[0029]
【The invention's effect】
The effect of the present invention is that, firstly, by extracting a related phrase for the search request phrase and performing classification related thereto, classification processing can be performed in a shorter time compared to a technique such as clustering. Second, In other words, it is possible to follow an event in which the importance of the search request word changes reflecting the fashion or the like, and it becomes possible to classify the search request word according to the words in line with the interests of many users.
[Brief description of the drawings]
FIG. 1 is a block diagram showing an example of an embodiment of a document search apparatus of the present invention. FIG. 2 is a flowchart showing an example of a document search and search result display process. FIG. 3 is a search result display according to the process flow of FIG. FIG. 4 is a flowchart showing another example of document search and search result display processing. FIG. 5 is an explanatory diagram showing a specific example of search result display processing according to the processing flow of FIG. Explanation of]
1: search request input processing unit, 2: document search processing unit, 3: search result display processing unit, 4: set of search target documents.

Claims

A document search apparatus for searching and presenting a document including a desired phrase from a set of search target documents stored in advance,
A search request input means for receiving a phrase input by a user as a search request phrase;
A document search means for searching and outputting a document including a search request phrase input by the user from a set of search target documents;
A phrase adjacent to the search request phrase input by the user is extracted from the history of search request phrases that are a set of search request phrases that have been input and used in the past, and the appearance frequency is calculated. Of the documents output by the document search means, the search request words and their related words and phrases are output from the document search means, assuming that the frequently used words and phrases are related to the search request words inputted by the user. And a search result display means for displaying a document including a concatenated word phrase and a group for each related word phrase.

A document search program for searching and presenting a document including a desired phrase from a set of search target documents stored in advance,
The program is stored on the computer
A search request input step for receiving a phrase input by a user as a search request phrase;
A document search step of searching and outputting a document including a search request word input by the user from a set of search target documents;
A phrase adjacent to the search request phrase input by the user is extracted from the history of search request phrases that are a set of search request phrases that have been input and used in the past, and the appearance frequency is calculated. A search request phrase entered by the user and these related phrases in the searched and output document are regarded as a related phrase having a high degree of association with the search request phrase input by the user. And a search result display step of displaying a document including a concatenated word and a grouped display for each related word.

A computer-readable medium having the document search program according to claim 2 recorded thereon.