JP2006139484A

JP2006139484A - Information retrieval method, system therefor and computer program

Info

Publication number: JP2006139484A
Application number: JP2004327849A
Authority: JP
Inventors: Takaaki Hasegawa; 隆明長谷川; Hisashi Obara; 永小原
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2004-11-11
Filing date: 2004-11-11
Publication date: 2006-06-01
Anticipated expiration: 2024-11-11
Also published as: JP4428703B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide an information retrieval method, a system therefor and a computer program, allowing retrieval of co-occurrence expression included in a desired document by the keyword retrieval in a bird's eye view without previously producing a dictionary or a pattern, without classifying documents. <P>SOLUTION: The document present in a document set database is retrieved by a query word as a retrieval keyword, a word present on the periphery of the query word is extracted in the document as a co-occurrence word, the document present in the document set database is retrieved, the number of the documents each including only the query word, the number of the documents each including only the co-occurrence word, and the number of the documents each including both the query word and the co-occurrence word are acquired, a relevance degree between the query word and the co-occurrence word is calculated on the basis of the acquired number of the documents, and a phrase and a sentence including both the query word and an important co-occurrence word are extracted from the document set database with the co-occurrence word wherein the relevance degree is present within a prescribed range as the important co-occurrence word, and are exhibited in each the important co-occurrence word as the co-occurrence expression. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

本発明は、大規模に蓄積されている文書に対して、ユーザが特定のキーワードを入力することにより、キーワードに関連する共起単語を抽出して、共起単語を含む共起表現を検索する情報検索方法及びそのシステムに関する。 The present invention retrieves a co-occurrence expression including a co-occurrence word by extracting a co-occurrence word related to the keyword by inputting a specific keyword for a document stored on a large scale. The present invention relates to an information retrieval method and system.

インターネットの発達により個人が情報を発信する機会が大きく増え、ユーザはインターネットを使って特定の話題について多くの他人の意見を調べることが可能になった。しかし、ユーザは検索エンジンを使って特定の話題について調べる際に、話題に関連する検索に効果的なキーワードをあらかじめ想定するのは難しいため、通常検索エンジンに入力されるキーワードの数はせいぜい２語から３語であり１語だけの場合もある。少ないキーワードの入力でも、検索結果を分類・クラスタリングしてユーザに提示することにより、ユーザの検索意図が明確になるように誘導する研究開発が行われている[非特許文献１]。 The development of the Internet has greatly increased the opportunities for individuals to send information, and users can use the Internet to examine many others' opinions on specific topics. However, when a user uses a search engine to examine a specific topic, it is difficult to assume in advance keywords that are effective for a search related to the topic, so the number of keywords that are normally input to the search engine is at most two words. In some cases, there are only 3 words. Research and development has been conducted to guide a user's search intention to be clarified by classifying and clustering search results even when inputting a small number of keywords and presenting the search results to the user [Non-patent Document 1].

一般に文書のクラスタリングは、文書に含まれる単語に何らかの重みを付けることにより、文書をベクトルとして表現し、ベクトルの類似している文書をグループ化することにより実現される[非特許文献２]。 In general, document clustering is realized by assigning some weight to words included in a document, expressing the document as a vector, and grouping documents having similar vectors [Non-patent Document 2].

個人が情報発信する文書には日記形式の文書が多く、そのような文書では日付の後ろに個人の意見や感想が述べられているものが繰り返される。個人の関心は多様であり日々変化していくので、結果としてひとつの文書の中に雑多な話題が混在している場合も存在する。このため、文書単位の分類では、話題単位で文書を分類することは難しい。また、分類された文書集合について、ひとつひとつの文書を開いて内容を確認することになるので、ピンポイントに個々の意見情報にアクセスしたり全体の意見の傾向を掴んだりするという観点から見れば効率が悪いという問題も存在する。 There are many diary-type documents in which information is transmitted by individuals. In such documents, the ones in which the opinions and impressions of individuals are described after the date are repeated. Individual interests vary and change from day to day. As a result, there are cases where miscellaneous topics are mixed in one document. For this reason, it is difficult to classify documents in topic units in document unit classification. Moreover, since the contents of each classified document set are opened and checked, it is efficient from the viewpoint of accessing individual opinion information or grasping the trend of the entire opinion. There is also a problem that is bad.

一方、あらかじめ抽出したい話題の内容がはっきりと決まっている場合には、[非特許文献３]に記載されているように話題に関係する辞書やパターンを事前に作成しておき、これに合致する部分を文書集合から抽出する方法も提案されている。しかしながら、抽出したい内容が事前には不明であったり、話題を動的に変化させたりする場合にはこのようなアプローチでは難しい。
“情報検索結果の知的提示のための自動要約ならびにインタフェースに関する研究”,http://www.forest.eis.ynu.ac.jp/〜mori/Kaken/Informatics/ 岩波講座ソフトウェア科学15 自然言語処理、長尾真編、11章 “Web 文書集合からの意見情報抽出と着眼点に基づく要約生成”立石他，言語処理学会第10回年次大会発表論文集(2004年3月) On the other hand, when the content of the topic to be extracted is clearly determined in advance, a dictionary or pattern related to the topic is created in advance as described in [Non-Patent Document 3] and matches this. A method for extracting a part from a document set has also been proposed. However, such an approach is difficult when the content to be extracted is unknown in advance or when the topic is dynamically changed.
“Research on automatic summarization and interface for intelligent presentation of information retrieval results”, http://www.forest.eis.ynu.ac.jp/〜mori/Kaken/Informatics/ Iwanami Laboratory Software Science 15 Natural Language Processing, Shin Nagao, Chapter 11 "Opinion information extraction from Web document collection and summary generation based on focus" Tateishi et al., Proc. Of the 10th Annual Conference of the Language Processing Society (March 2004)

上記に述べたように、検索結果の文書分類では文書に雑多な話題が含まれるため話題単位で文書を分類することは難しいし、分類された文書の内容を個別に確認する必要があり話題を俯瞰するには効率が悪い。 As mentioned above, it is difficult to classify documents in units of topics because the document classification of search results includes various topics, and it is necessary to check the contents of classified documents individually. It is inefficient to look down.

一方で、話題ごとに関係する辞書やパターンを事前に作成することも時間やコストの問題が存在する。 On the other hand, creating dictionaries and patterns related to each topic in advance also has time and cost problems.

本発明はこの問題を解決するため、文書を分類することなく、また辞書やパターンを事前に作成することなく、検索キーワードによる検索によって所望の文書に含まれる共起表現を俯瞰的に検索可能とする情報検索方法及びそのシステム並びにコンピュータプログラムを提供することを目的とするものである。 In order to solve this problem, the present invention makes it possible to search a co-occurrence expression included in a desired document in a bird's-eye view by searching with a search keyword without classifying the document or creating a dictionary or a pattern in advance. It is an object of the present invention to provide an information search method, a system thereof, and a computer program.

本発明は、上記の目的を達成するために、コンピュータ装置を用いて、複数の文書が格納されている文書集合データベースから検索キーワードとしてのクエリ単語に関連する共起表現を提示する情報検索方法であって、前記コンピュータ装置は、前記検索キーワードとしてのクエリ単語で前記文書集合データベースに存在する文書を検索し、該文書中において前記クエリ単語の周辺に存在する単語を共起単語として抽出するステップと、前記文書データベースに存在する文書から、前記クエリ単語のみで検索した文書数、前記共起単語のみで検索した文書数、前記クエリ単語と前記共起単語の両者で検索した文書数を取得するステップと、前記取得した文書数に基づいて前記クエリ単語と前記共起単語との関連度を算出するステップと、前記算出した関連度が所定条件を満たす共起単語とクエリ単語を同時に含む文及びフレーズの少なくとも何れか一方である共起表現を、前記文書集合データベースに存在する文書から収集するステップと、前記共起表現を提示するステップとを実行する情報検索方法を提案する。 In order to achieve the above object, the present invention provides an information search method for presenting a co-occurrence expression related to a query word as a search keyword from a document set database storing a plurality of documents using a computer device. The computer device searches for a document existing in the document set database with a query word as the search keyword, and extracts a word existing around the query word in the document as a co-occurrence word; Obtaining from the documents existing in the document database the number of documents searched only by the query word, the number of documents searched only by the co-occurrence word, and the number of documents searched by both the query word and the co-occurrence word. Calculating the relevance between the query word and the co-occurrence word based on the acquired number of documents; Collecting a co-occurrence expression that is at least one of a sentence and a phrase that simultaneously include a co-occurrence word and a query word that satisfy a predetermined condition, from the documents existing in the document set database; and the co-occurrence An information retrieval method is proposed that performs the step of presenting an expression.

本発明の情報検索方法によれば、クエリ単語で文書集合データベースに存在する文書が検索され、該文書中においてクエリ単語の周辺に存在する単語が共起単語として抽出される。また、文書集合データベースに存在する文書を検索し、クエリ単語だけを含む文書の数と、共起単語だけを含む文書の数と、クエリ単語と共起単語の両者を含む文書の数が取得され、取得した文書数に基づいてクエリ単語と共起単語との関連度が算出される。さらに、算出された関連度に基づいて、クエリ単語と共起単語との関連が強い共起単語を重要共起単語として、クエリ単語及び該重要共起単語を共に含む文及びフレーズの少なくとも何れか一方が文書集合データベースに存在する文書から抽出されて共起表現とされ、この共起表現が提示される。 According to the information retrieval method of the present invention, a document existing in the document set database is searched with a query word, and words existing around the query word in the document are extracted as co-occurrence words. In addition, a document existing in the document set database is searched, and the number of documents including only query words, the number of documents including only co-occurrence words, and the number of documents including both query words and co-occurrence words are obtained. The degree of association between the query word and the co-occurrence word is calculated based on the acquired number of documents. Further, based on the calculated degree of association, a co-occurrence word having a strong association between the query word and the co-occurrence word is regarded as an important co-occurrence word, and at least one of a sentence and a phrase including both the query word and the important co-occurrence word One of them is extracted from a document existing in the document set database as a co-occurrence expression, and this co-occurrence expression is presented.

これにより、例えば意見情報を含む文書に含まれる話題に関する少量のキーワードによる検索結果から適合する文書数に基づいて話題に関連する共起単語を抽出し、共起単語を含むフレーズや文の共起表現を意見情報を含む文書集合から収集し、共起単語ごとにまとめられた共起表現をユーザに提示することが可能になり、ユーザが所望の話題についての意見情報を俯瞰的に検索可能となる。 Thus, for example, co-occurrence words related to a topic are extracted based on the number of matching documents from a search result with a small amount of keywords related to the topic included in the document including opinion information, and the co-occurrence of phrases and sentences including the co-occurrence word It is possible to collect expressions from a document set including opinion information, present co-occurrence expressions organized for each co-occurrence word to the user, and allow the user to search the opinion information on the desired topic from a bird's-eye view Become.

また、本発明は、上記方法を実施可能とするために、複数の文書が格納されている文書集合データベースから検索キーワードとしてのクエリ単語に関連する共起表現を提示する情報検索システムであって、複数の文書が格納されている文書集合データベースと、前記文書集合データベースから入力された単語で検索された文書を格納する検索文書データベースと、検索文書内でクエリ単語と共起する共起単語を格納する共起単語リストと、前記クエリ単語で前記文書集合データベースを検索したときに得られる適合文書数と、前記共起単語で前記文書集合データベースを検索したときの適合文書数と、クエリ単語と共起単語の対で前記文書集合データベースを検索したときの適合文書数とを格納する適合文書数テーブルと、共起単語の中で所定条件を満たす重要共起単語とクエリ単語を格納する重要共起単語リストと、クエリ単語と共起単語を同時に含む文及びフレーズの少なくとも何れか一方を格納する共起表現データベースと、単語を入力する単語入力部と、前記単語入力部より入力された単語をクエリ単語とし、前記文書集合データベースからクエリ単語を含む文書を検索して検索文書を取得し、クエリ単語に対応させて前記検索文書を前記検索文書データベースに格納する適合文書検索部と、前記検索文書データベースに格納されている各文書を単語に分割し、クエリ単語の周辺に存在する単語を共起単語とし、前記クエリ単語に対応させて前記共起単語を前記共起単語リストに登録する共起単語取得部と、前記クエリ単語で前記文書集合データベースを検索したときに得られる適合文書数と、前記共起単語リストに登録されている各共起単語で前記文書集合データベースを検索したときに得られる適合文書数と、前記クエリ単語と該クエリ単語に対応する各共起単語の対で前記文書集合データベースを検索したときに得られる適合文書数とを前記適合文書数テーブルに格納する適合文書数取得部と、前記適合文書数テーブルを参照し、クエリ単語と共起単語の関連度を計算する関連度計算部と、前記クエリ単語と共起単語の関連度が所定条件を満たす共起単語を重要共起単語として、クエリ単語に対応させて重要共起単語を前記重要共起単語リストに格納する重要共起単語格納部と、前記重要共起単語リストに格納されているクエリ単語と各重要共起単語で前記文書集合データベースを検索し、これらの単語を含む文書を取得し、クエリ単語と各重要共起単語を同時に含む文及びフレーズの少なくとも何れか一方を対象となる文書から抽出して共起表現として前記共起表現データベースに格納する共起表現収集部と、前記共起表現収集部に格納されている共起表現を重要共起単語ごとに出力表示する出力部とを備えている情報検索システムを構成した。 Further, the present invention is an information search system for presenting a co-occurrence expression related to a query word as a search keyword from a document set database storing a plurality of documents in order to enable the above method. A document set database storing a plurality of documents, a search document database storing documents searched with words input from the document set database, and a co-occurrence word co-occurring with a query word in the search document The co-occurrence word list, the number of matching documents obtained when searching the document set database with the query word, the number of matching documents when searching the document set database with the co-occurrence word, and the query word. A matching document number table that stores the number of matching documents when the document set database is searched for a pair of origin words, Input a co-occurrence word list that stores important co-occurrence words and query words that satisfy a condition, a co-occurrence expression database that stores at least one of sentences and phrases that simultaneously include the query words and co-occurrence words, and a word A word input unit and a word input from the word input unit as a query word, search a document including the query word from the document set database to obtain a search document, and the search document corresponding to the query word Relevant document search unit to be stored in the search document database, each document stored in the search document database is divided into words, and words existing around the query word are used as co-occurrence words, corresponding to the query words A co-occurrence word acquisition unit for registering the co-occurrence word in the co-occurrence word list, and obtained when the document set database is searched with the query word. The number of matching documents, the number of matching documents obtained when searching the document set database with each co-occurrence word registered in the co-occurrence word list, the query word and each co-occurrence corresponding to the query word A conforming document number acquisition unit that stores in the conforming document number table the number of conforming documents obtained when the document set database is searched for a pair of words; and referring to the conforming document number table, a query word and a co-occurrence word A degree-of-association calculation unit that calculates the degree of association of the query word and the co-occurrence word satisfying a predetermined condition as an important co-occurrence word, and the important co-occurrence word corresponding to the query word An important co-occurrence word storage unit stored in the co-occurrence word list, a query word stored in the important co-occurrence word list, and each important co-occurrence word are searched for the document set database, and these words are included. A co-occurrence expression collection in which at least one of a sentence and a phrase simultaneously including a query word and each important co-occurrence word is extracted from the target document and stored in the co-occurrence expression database. And an output unit that outputs and displays the co-occurrence expressions stored in the co-occurrence expression collection unit for each important co-occurrence word.

本発明の情報検索方法及びそのシステムによれば、意見情報を含む文書集合を対象にユーザが調べたい話題について、話題に適度に関連してかつ一般的過ぎない共起単語を適合文書数に基づいて獲得し、話題語と共起単語を同時に含む文やフレーズを共起表現として網羅的に収集し、共起単語ごとにユーザに提示するので、ユーザは調べたい話題についての軸となるキーワードやそれを含む意見情報を俯瞰することが可能となる。 According to the information search method and the system of the present invention, for a topic that a user wants to examine for a document set including opinion information, a co-occurrence word that is moderately related to the topic and is not too general is based on the number of matching documents. Sentences and phrases that simultaneously contain topic words and co-occurrence words are comprehensively collected as co-occurrence expressions and presented to the user for each co-occurrence word. It becomes possible to overlook opinion information including it.

図１は本発明の一実施形態における情報検索システムを示す構成図である。図において、100は情報検索システムで、単語入力部200、適合文書検索部300、文書集合データベース400、検索文書データベース500、共起単語取得部600、共起単語リスト700、適合文書数取得部800、適合文書数テーブル900、関連度計算部1000、重要共起単語格納部1100、重要共起単語リスト1200、共起表現収集部1300、共起表現データベース1400、出力部1500を備えており、少なくとも１つの周知のコンピュータ装置によって構成されている。 FIG. 1 is a configuration diagram showing an information search system according to an embodiment of the present invention. In the figure, reference numeral 100 denotes an information search system, which includes a word input unit 200, a compatible document search unit 300, a document set database 400, a search document database 500, a co-occurrence word acquisition unit 600, a co-occurrence word list 700, and a compatible document number acquisition unit 800. , Conforming document number table 900, relevance calculation unit 1000, important co-occurrence word storage unit 1100, important co-occurrence word list 1200, co-occurrence expression collection unit 1300, co-occurrence expression database 1400, output unit 1500, at least It is constituted by one known computer device.

単語入力部200は、外部よりクエリとしての単語（以下、単にクエリ単語と称する）を入力する。 The word input unit 200 inputs a word as a query (hereinafter simply referred to as a query word) from the outside.

適合文書検索部300は、入力されたクエリ単語に適合する文書すなわちクエリ単語を含む文書を文書集合データベース400から検索し、あらかじめ定められた所定文書数だけの文書を検索文書として取得し、この検索文書をクエリ単語と組にして検索文書データベース500に格納する。ここで、文書集合データベース400には、あらかじめ大量の文書が格納されている。 The matching document search unit 300 searches the document set database 400 for a document that matches the input query word, that is, a document including the query word, acquires a predetermined number of documents as a search document, and searches this document. The document is paired with the query word and stored in the search document database 500. Here, a large amount of documents are stored in the document set database 400 in advance.

検索文書データベース500には、適合文書検索部300によって取得された検索文書がクエリ単語と組にされて格納される。 In the search document database 500, the search document acquired by the matching document search unit 300 is paired with the query word and stored.

共起単語取得部600は、検索文書データベース500に格納されている検索文書に対して、文書にタグが付いていればタグを削除し、各検索文書を単語に分割し、クエリ単語の周辺に存在する単語を共起単語として抽出し、この抽出した共起単語をクエリ単語に対応させて共起単語リスト700に登録する。 The co-occurrence word acquisition unit 600 deletes a tag from the search document stored in the search document database 500 if the document is tagged, divides each search document into words, and places the search word around the query word. Existing words are extracted as co-occurrence words, and the extracted co-occurrence words are registered in the co-occurrence word list 700 in correspondence with the query words.

共起単語リスト700は、検索文書内でクエリ単語と共起する共起単語をクエリ単語に対応させて格納する。 The co-occurrence word list 700 stores co-occurrence words that co-occur with query words in the search document in association with the query words.

適合文書数取得部800は、共起単語リスト700に格納されている各共起単語で文書集合データベース400を検索したときに得られる適合文書数と、クエリ単語と各共起単語の対で文書集合データベース400を検索したときに得られる適合文書数を適合文書数テーブル900に格納する。 The number-of-matching document acquisition unit 800 is configured to obtain a document by using the number of matching documents obtained when searching the document set database 400 with each co-occurrence word stored in the co-occurrence word list 700, and a pair of the query word and each co-occurrence word. The number of conforming documents obtained when searching the collective database 400 is stored in the conforming document number table 900.

従って、適合文書数テーブル900には、共起単語で文書集合データベース400を検索したときの適合文書数と、クエリ単語とこのクエリ単語に対応する共起単語の対で文書集合データベース400を検索したときの適合文書数が格納される。 Therefore, in the matching document number table 900, the document set database 400 is searched by the number of matching documents when the document set database 400 is searched by the co-occurrence word, and the query word and the co-occurrence word corresponding to the query word. The number of relevant documents at that time is stored.

関連度計算部1000は、適合文書数テーブル900を参照して、クエリ単語と共起単語の関連度を計算する。この関連度の計算方法に関しては後述する。 The relevance calculation unit 1000 refers to the matching document number table 900 and calculates the relevance of the query word and the co-occurrence word. A method for calculating the degree of association will be described later.

重要共起単語格納部1100は、クエリ単語と共起単語の関連度があらかじめ定められた条件を満たす共起単語を重要共起単語として、この重要共起単語をクエリ単語に対応させて重要共起単語リスト1200に格納する。 The important co-occurrence word storage unit 1100 sets a co-occurrence word satisfying a predetermined degree of association between the query word and the co-occurrence word as an important co-occurrence word, and associates the important co-occurrence word with the query word as an important co-occurrence word. Store in word list 1200.

従って、重要共起単語リスト1200には、共起単語の中である定められた条件を満たす重要共起単語とクエリ単語が格納される。 Therefore, the important co-occurrence word list 1200 stores important co-occurrence words and query words that satisfy certain conditions in the co-occurrence words.

共起表現収集部1300は、重要共起単語リスト1200に格納されているクエリ単語と各重要共起単語で文書集合データベース400を検索し、この検索に適合する文書を得て、あらかじめ定められた文書数だけを対象として、クエリ単語と各重要共起単語を同時に含む文やフレーズを対象となる文書から網羅的に収集し、これらの文やフレーズを共起表現として共起表現データベース1400に格納する。 The co-occurrence expression collection unit 1300 searches the document set database 400 using the query words stored in the important co-occurrence word list 1200 and each important co-occurrence word, obtains documents that match the search, and is determined in advance. For the number of documents only, sentences and phrases that simultaneously contain query words and each important co-occurrence word are comprehensively collected from the target documents, and these sentences and phrases are stored in the co-occurrence expression database 1400 as co-occurrence expressions. To do.

従って、共起表現データベース1400には、クエリ単語と重要共起単語を同時に含む文やフレーズが格納される。 Accordingly, the co-occurrence expression database 1400 stores sentences and phrases that simultaneously include the query word and the important co-occurrence word.

出力部1500は、共起表現収集部1300に格納されている共起表現を重要共起単語ごとに出力表示する。 The output unit 1500 outputs and displays the co-occurrence expressions stored in the co-occurrence expression collection unit 1300 for each important co-occurrence word.

次に、前述の構成よりなる情報検索システムのコンピュータプログラム処理動作を図２に示すフローチャートを参照して説明する。 Next, the computer program processing operation of the information search system having the above-described configuration will be described with reference to the flowchart shown in FIG.

情報検索システム100は、単語入力部200から単語が入力されると（Ｓ１）、入力された単語をクエリ単語として文書集合データベース400からクエリ単語に適合する文書を検索し（Ｓ２）、クエリ単語に適合した文書数を取得してクエリ単語に対応させて適合文書数テーブル900に格納する（Ｓ３）と共に、あらかじめ定められた数値以内の文書数だけ検索文書を取得し（Ｓ４）、取得した検索文書をクエリ単語と対応させて検索文書データベース500に格納する（Ｓ５）。 When a word is input from the word input unit 200 (S1), the information search system 100 searches the document set database 400 for a document that matches the query word using the input word as a query word (S2), and sets the query word as the query word. The number of conforming documents is acquired and stored in the conforming document number table 900 corresponding to the query word (S3), and the number of documents within the predetermined numerical value is acquired (S4), and the retrieved search documents are acquired. Is associated with the query word and stored in the search document database 500 (S5).

次いで、検索文書データベース500に格納されている文書に対して、文書にタグが付いていればタグを削除し、各文書を単語に分割し（Ｓ６）、クエリ単語の周辺に存在する単語を共起単語として抽出する（Ｓ７）と共に、抽出した共起単語をクエリ単語に対応させて共起単語リスト700に登録する（Ｓ８）。 Next, if a tag is attached to the document stored in the search document database 500, the tag is deleted, each document is divided into words (S6), and words existing around the query word are shared. The extracted co-occurrence word is registered in the co-occurrence word list 700 in correspondence with the query word (S8).

この後、共起単語リスト700に格納されている各共起単語で文書集合データベース400を検索したときに得られる適合文書数と、クエリ単語とこのクエリ単語に対応する各共起単語との対で文書集合データベース400を検索したときに得られる適合文書数を適合文書数テーブル900に格納する（Ｓ９）。 After that, the number of matching documents obtained when searching the document set database 400 with each co-occurrence word stored in the co-occurrence word list 700 is a pair of the query word and each co-occurrence word corresponding to the query word. In step S9, the number of conforming documents obtained when the document set database 400 is retrieved is stored in the conforming document number table 900.

次に、適合文書数テーブル900を参照し、クエリ単語と共起単語の関連度を共起単語毎に計算する（Ｓ１０）。この関連度の計算方法に関しては、その一具体例を後述する。 Next, the matching document count table 900 is referred to, and the degree of association between the query word and the co-occurrence word is calculated for each co-occurrence word (S10). A specific example of the relevance calculation method will be described later.

さらに、情報検索システム100は、上記算出したクエリ単語と共起単語の関連度があらかじめ定められた条件を満たす共起単語を重要共起単語とし（Ｓ１１）、クエリ単語に対応させて重要共起単語を重要共起単語リスト1200に格納する（Ｓ１２）。 Further, the information retrieval system 100 sets a co-occurrence word satisfying a predetermined degree of association between the calculated query word and the co-occurrence word as an important co-occurrence word (S11), and makes an important co-occurrence corresponding to the query word. The word is stored in the important co-occurrence word list 1200 (S12).

次いで、重要共起単語リスト1200に格納されているクエリ単語と各重要共起単語で文書集合データベース400を検索し（Ｓ１３）、これらの単語に適合する文書すなわちこれらの単語を含む文書を抽出し（Ｓ１４）、あらかじめ定められた数の文書だけを対象としてクエリ単語と各重要共起単語を同時に含む文やフレーズを対象となる文書から網羅的に収集して、これらの文やフレーズを共起表現とし（Ｓ１５）、これらの共起表現を共起表現データベース1400に格納する（Ｓ１６）。 Next, the document set database 400 is searched with the query words stored in the important co-occurrence word list 1200 and each important co-occurrence word (S13), and documents matching these words, that is, documents including these words are extracted. (S14) Collecting sentences and phrases including the query word and each important co-occurrence word simultaneously from only a predetermined number of documents from the target document, and co-occurring these sentences and phrases These expressions are used as expressions (S15), and these co-occurrence expressions are stored in the co-occurrence expression database 1400 (S16).

この後、情報検索システム100は、共起表現データベース1400に格納されている共起表現を重要共起単語ごとに出力表示する（Ｓ１７）。 Thereafter, the information retrieval system 100 outputs and displays the co-occurrence expressions stored in the co-occurrence expression database 1400 for each important co-occurrence word (S17).

以下、図１乃至図６を参照し、一具体例を用いて、本実施形態における情報検索システム100の動作を説明する。 Hereinafter, the operation of the information search system 100 according to this embodiment will be described with reference to FIGS. 1 to 6 using a specific example.

例えば、単語入力部200に「デジカメ1」という製品名が入力されたとする。適合文書検索部300はクエリ単語を「デジカメ1」として、これに適合する文書を抽出するために文書集合データベース400を検索する。尚、ここでの文書集合データベース400の形式は特に規定されるものではなく、[非特許文献２]に示されるようなインデックスを保持しても良い。 For example, it is assumed that the product name “digital camera 1” is input to the word input unit 200. The matching document search unit 300 sets the query word “digital camera 1” and searches the document set database 400 to extract documents that match the query word. Note that the format of the document collection database 400 here is not particularly defined, and an index as shown in [Non-Patent Document 2] may be held.

あらかじめ設定される文書数を例えば100とすると、適合文書検索部300は、検索結果のランキング順の上位100個の文書を検索文書データベース500に格納する。尚、本具体例では、検索文書データベース500は、文書IDと文書のテキストからなる。文書IDはURL等であってもよい。 If the number of documents set in advance is 100, for example, the conforming document search unit 300 stores the top 100 documents in the search result ranking order in the search document database 500. In this specific example, the search document database 500 includes a document ID and document text. The document ID may be a URL or the like.

共起単語取得部600は、検索文書データベース500に格納されている100個の文書について「デジカメ1」を含むパラグラフあるいは文書全体に存在する単語を抽出し、共起単語リスト700に格納する。このとき、形態素解析を行って、例えば品詞が名詞である単語だけに限定しても良いし、名詞の連続する複合語を含めてもよいし、あるいは品詞が動詞や形容詞の単語としてもよい。 The co-occurrence word acquisition unit 600 extracts words existing in a paragraph including “digital camera 1” or the entire document from 100 documents stored in the search document database 500 and stores them in the co-occurrence word list 700. At this time, morphological analysis may be performed, for example, limiting to only words whose part of speech is a noun, including a compound word having a continuous noun, or part of speech being a verb or adjective word.

図３は共起単語リスト700の一例を示す図である。クエリ単語である「デジカメ1」に対応する共起単語として、例えば「起動」、「バッテリ」、「小型」、「軽量」、「画質」、「レスポンス」、「ストロボ」、「シャッター」、「メーカー1」、「商品」、「デジカメ2」等が格納される。 FIG. 3 is a diagram illustrating an example of the co-occurrence word list 700. As co-occurrence words corresponding to the query word “digital camera 1”, for example, “activation”, “battery”, “small”, “light”, “image quality”, “response”, “strobe”, “shutter”, “ Stores “Manufacturer 1”, “Product”, “Digital camera 2”, and the like.

適合文書数取得部800は、共起単語リスト700に格納されている共起単語の各々について、それをキーワードして文書集合データベース400から適合する文書の数を取得する。さらに、共起単語リスト700に格納されているクエリ単語と共起単語の各々とを組み合わせて、これをキーワードとして文書集合データベース400から適合する文書の数を取得する。すなわち、このキーワードを含む文書の数を取得する。例えば「デジカメ1 起動」や「デジカメ1 バッテリ」をキーワードとして文書集合データベース400から適合する文書の数を取得する。次いで、これらの文書数を適合文書数テーブル900に格納する。 The matching document number acquisition unit 800 acquires the number of matching documents from the document set database 400 using each of the co-occurrence words stored in the co-occurrence word list 700 as a keyword. Further, the query word stored in the co-occurrence word list 700 and each of the co-occurrence words are combined, and the number of matching documents is acquired from the document set database 400 using this as a keyword. That is, the number of documents including this keyword is acquired. For example, the number of matching documents is acquired from the document set database 400 using “digital camera 1 activation” and “digital camera 1 battery” as keywords. Next, these document numbers are stored in the matching document number table 900.

図４は適合文書数テーブル900の一例を示す図である。共起単語単独の「起動」に適合する文書数は 1,230,000件で、クエリ単語と共起単語のアンドの「デジカメ1 起動」に適合する文書数は 1,920件であることを示している。 FIG. 4 is a diagram showing an example of the conforming document number table 900. This indicates that the number of documents that match the “activation” of the co-occurrence word alone is 1,230,000, and the number of documents that match the “digital camera 1 activation” of the query word and the co-occurrence word is 1,920.

関連度計算部1000は、適合文書数テーブル900を参照し、次式によってクエリ単語ｑと共起単語ｗの関連度Ｒ(ｑ，ｗ)を計算し、適合文書数テーブル900の関連度を更新する。 The relevance calculation unit 1000 refers to the conformance document count table 900, calculates the relevance R (q, w) between the query word q and the co-occurrence word w by the following formula, and updates the relevance of the conformance document count table 900 To do.

ここで、Ｈ(ｑ)は文書集合データベース400に対してクエリ単語ｑが適合する文書数である。Ｈ(ｗ)は文書集合データベース400に対して共起単語ｗが適合する文書数である。Ｈ(ｑ，ｗ)はクエリ単語ｑと共起単語ｗのアンド検索で適合する文書数である。また、式中における「＊」は乗算を表す。 Here, H (q) is the number of documents that match the query word q with respect to the document set database 400. H (w) is the number of documents to which the co-occurrence word w matches the document set database 400. H (q, w) is the number of documents that match in the AND search of the query word q and the co-occurrence word w. Further, “*” in the formula represents multiplication.

この式のように、クエリ単語ｑを含む文書の数Ｈ(ｑ)と共起単語ｗを含む文書の数Ｈ(ｗ)とを乗算した値で、クエリ単語ｑと共起単語ｗの双方を含む文書の数Ｈ(ｑ，ｗ)を除算した値の対数を関連度Ｒ(ｑ，ｗ)として算出する。 As shown in this equation, both the query word q and the co-occurrence word w are obtained by multiplying the number H (q) of documents including the query word q by the number H (w) of documents including the co-occurrence word w. The logarithm of the value obtained by dividing the number of included documents H (q, w) is calculated as the relevance R (q, w).

この式の意図するところは、クエリ単語と共起単語の文書集合における相互情報量に相当する情報量を計算することである。そのため、ここでは相互情報量の算出式と類似した計算式を採用している。相違点は、相互情報量は文書集合における２つの単語の出現頻度に基づいて計算されるものであるが、文書集合が大規模になると直接出現頻度を求めるのは効率が悪いので、ここでは代わりに２つの単語が適合する文書数を用いている。 The intent of this equation is to calculate the amount of information corresponding to the amount of mutual information in the document set of query words and co-occurrence words. For this reason, a calculation formula similar to the calculation formula for the mutual information amount is adopted here. The difference is that the mutual information amount is calculated based on the appearance frequency of two words in the document set, but it is not efficient to directly determine the appearance frequency when the document set becomes large, so here The number of documents that match two words is used.

この計算式の値から、一般的過ぎる共起単語または関連が強すぎる共起単語を推定する。計算式の値は、共起単語が一般的過ぎると小さくなり、共起単語が強く関連すると大きくなる。そこで、重要共起単語格納部1100は、関連度Ｒに対する条件として、一般的過ぎる共起単語を除いたり逆にクエリ単語との関連が強すぎたりする共起単語を除く必要があるため、関連度Ｒ(ｑ，ｗ)が所定の範囲内にある共起単語を重要共起単語として重要共起単語リスト1200に格納する。例えば閾値の範囲を-17から-15に設定し、適合文書数テーブル900を参照してこの閾値の範囲に含まれる関連度を有する共起単語を重要共起単語として重要共起単語リスト1200に格納する。例えば、図４においては、-16.9の関連度を持つ「起動」や-15.1の関連度を持つ「バッテリ」の共起単語を重要共起単語として重要共起単語リスト1200に格納する。 A co-occurrence word that is too general or too strong is estimated from the value of this calculation formula. The value of the calculation formula decreases when the co-occurrence word is too general, and increases when the co-occurrence word is strongly related. Therefore, since the important co-occurrence word storage unit 1100 needs to exclude co-occurrence words that are too general or are too strongly related to the query word as a condition for the degree of relevance R, Co-occurrence words whose degrees R (q, w) are within a predetermined range are stored in the important co-occurrence word list 1200 as important co-occurrence words. For example, the threshold range is set to -17 to -15, and the co-occurrence words having the relevance included in the threshold range are referred to the important co-occurrence word list 1200 as the important co-occurrence words with reference to the matching document number table 900. Store. For example, in FIG. 4, the co-occurrence words of “activation” having a relevance degree of −16.9 and “battery” having a relevance degree of −15.1 are stored in the important co-occurrence word list 1200 as important co-occurrence words.

図５は重要共起単語リスト1200の一例を示す図である。重要共起単語リスト1200には、クエリ単語としての「デジカメ1」と、重要共起単語として「起動」、「バッテリ」、「小型」、「軽量」、「画質」等が格納される。 FIG. 5 is a diagram showing an example of the important co-occurrence word list 1200. The important co-occurrence word list 1200 stores “digital camera 1” as a query word and “startup”, “battery”, “small”, “light”, “image quality”, and the like as important co-occurrence words.

ここで、関連が強すぎる共起単語を除く理由は、例えば図３のようなクエリ単語が「デジカメ1」である場合の共起単語「メーカー1」が相当するが、このような共起単語からはユーザが知りたいと考えている情報というよりは「メーカー1」が開発元や販売元であるような多くのユーザにとっては既知の情報しか得られず、ユーザに有益な情報を提示するという効果が薄くなるからである。 Here, the reason for excluding a co-occurrence word that is too strong corresponds to the co-occurrence word “maker 1” when the query word as shown in FIG. 3 is “digital camera 1”, for example. From the information that the user wants to know, rather than the information that is known to many users, such as "Manufacturer 1" is the developer and distributor, presents useful information to the user This is because the effect is reduced.

反対に、一般的過ぎる共起単語を除く理由は、例えばクエリ単語「デジカメ1」の場合の「商品」という共起単語からはユーザにとって自明の情報しか得られず、やはりユーザに有益な情報を提示することができないからである。 On the other hand, the reason for excluding too common co-occurrence words is that, for example, the co-occurrence word “product” in the case of the query word “digital camera 1” can only obtain information that is obvious to the user, and information useful for the user is also obtained. This is because it cannot be presented.

共起表現収集部1300は、重要共起単語リスト1200を参照し、クエリの単語と各重要共起単語の対を作成し、例えば「デジカメ1 起動」をキーワードとして文書集合データベース400を検索し、あらかじめ定められた文書数を100件とすると、ランキング順上位100件の文書を取得し、クエリ単語「デジカメ1」と重要共起単語「起動」を同時に含む箇所を抽出する。このとき、「デジカメ1」と「起動」を含む文全体でもよいし、句読点を単位とするフレーズでもよいし、「デジカメ1」と「起動」を両端とするフレーズでもよい。クエリ単語とすべての重要共起単語を含む箇所を共起表現として収集し、共起表現データベース1400に格納する。 The co-occurrence expression collection unit 1300 refers to the important co-occurrence word list 1200, creates a pair of the query word and each important co-occurrence word, for example, searches the document set database 400 using “digital camera 1 activation” as a keyword, Assuming that the predetermined number of documents is 100, the top 100 documents in the ranking order are acquired, and a part including the query word “digital camera 1” and the important co-occurrence word “activation” at the same time is extracted. At this time, the whole sentence including “digital camera 1” and “activation” may be used, a phrase having punctuation marks as a unit, or a phrase having “digital camera 1” and “activation” as both ends may be used. Locations including the query word and all important co-occurrence words are collected as co-occurrence expressions and stored in the co-occurrence expression database 1400.

図６は共起表現データベース1400の一例を示す図である。共起表現データベース1400は、共起表現を収集した文書の文書IDも同時に格納する。文書IDはURL等であってもよい。 FIG. 6 is a diagram illustrating an example of the co-occurrence expression database 1400. The co-occurrence expression database 1400 also stores the document IDs of documents for which the co-occurrence expressions are collected. The document ID may be a URL or the like.

出力部1500は、共起表現データベース1400を参照し、クエリ単語に関連する各重要共起単語と、クエリ単語と各重要共起単語の共起表現の集合を出力表示する。 The output unit 1500 refers to the co-occurrence expression database 1400 and outputs and displays each important co-occurrence word related to the query word and a set of co-occurrence expressions of the query word and each important co-occurrence word.

以上説明したように、本実施形態の情報検索システムによれば、意見情報を含む文書集合を対象にユーザが調べたい話題（クエリ単語）について、話題に関連するが一般的でない共起単語を適合文書数に基づいて獲得し、話題語と共起単語を同時に含む文やフレーズを共起表現として網羅的に収集し、共起単語ごとにユーザに提示することにより、ユーザは調べたい話題についての軸となるキーワードやそれを含む意見情報を俯瞰することが可能となる。 As described above, according to the information search system of the present embodiment, a topic (query word) that a user wants to examine for a document set including opinion information is matched with a co-occurrence word that is related to the topic but is not common. Acquired based on the number of documents, collects sentences and phrases that simultaneously contain topic words and co-occurrence words as a co-occurrence expression, and presents them to the user for each co-occurrence word. It becomes possible to look down on the key keywords and opinion information including them.

尚、上記実施形態の情報検索システムは本発明の一実施例であって、本発明がこれのみに限定されないことは言うまでもないことである。 The information search system of the above embodiment is an example of the present invention, and it goes without saying that the present invention is not limited to this.

本発明の一実施形態における情報検索システムを示す構成図The block diagram which shows the information search system in one Embodiment of this invention 本発明の一実施形態における情報検索システムのコンピュータプログラム動作を説明するフローチャートThe flowchart explaining the computer program operation | movement of the information search system in one Embodiment of this invention. 本発明の一実施形態における共起単語リストの一例を示す図The figure which shows an example of the co-occurrence word list in one Embodiment of this invention 本発明の一実施形態における適合文書数テーブルの一例を示す図The figure which shows an example of the applicable document number table in one Embodiment of this invention. 本発明の一実施形態における重要共起単語リストの一例を示す図The figure which shows an example of the important co-occurrence word list in one Embodiment of this invention 本発明の一実施形態における共起表現データベースの一例を示す図The figure which shows an example of the co-occurrence expression database in one Embodiment of this invention.

Explanation of symbols

100…情報検索システム、200…単語入力部、300…適合文書検索部、400…文書集合データベース、500…検索文書データベース、600…共起単語取得部、700…共起単語リスト、800…適合文書数取得部、900…適合文書数テーブル、1000…関連度計算部、1100…重要共起単語格納部、1200…重要共起単語リスト、1300…共起表現収集部、1400…共起表現データベース、1500…出力部。 100: Information search system, 200: Word input unit, 300: Relevant document search unit, 400 ... Document collection database, 500 ... Search document database, 600 ... Co-occurrence word acquisition unit, 700 ... Co-occurrence word list, 800 ... Relevant document Number acquisition unit, 900 ... conforming document number table, 1000 ... relevance calculation unit, 1100 ... important co-occurrence word storage unit, 1200 ... important co-occurrence word list, 1300 ... co-occurrence expression collection unit, 1400 ... co-occurrence expression database, 1500 ... Output section.

Claims

An information search method for presenting a co-occurrence expression related to a query word as a search keyword from a document set database in which a plurality of documents are stored using a computer device,
The computer device includes:
Searching a document existing in the document set database with a query word as the search keyword, and extracting a word existing around the query word in the document as a co-occurrence word;
Obtaining the number of documents searched only by the query word, the number of documents searched only by the co-occurrence word, the number of documents searched by both the query word and the co-occurrence word from documents existing in the document database; ,
Calculating a degree of association between the query word and the co-occurrence word based on the acquired number of documents;
Collecting a co-occurrence expression that is at least one of a sentence and a phrase that simultaneously include a co-occurrence word and a query word whose calculated relevance satisfies a predetermined condition from documents existing in the document set database;
And a step of presenting the co-occurrence expression.

The computer device includes:
Searching a document including a query word from the document set database, and storing the acquired document in the search document database in association with the query word;
Dividing each document stored in the search document database into words, and extracting words existing around the query words as co-occurrence words;
Registering the extracted co-occurrence words in the co-occurrence word list in association with the query words;
The number of conforming documents obtained when searching the document set database with the query word, and the number of conforming documents obtained when searching the document set database with each co-occurrence word stored in the co-occurrence word list Storing in the matching document number table the number of matching documents obtained when the document set database is searched for the query word and each co-occurrence word pair corresponding to the query word;
Referring to the matching document number table, calculating a degree of association between the query word and the co-occurrence word for each co-occurrence word;
A co-occurrence word satisfying a predetermined condition for the calculated degree of association is set as an important co-occurrence word, and the important co-occurrence word is stored in an important co-occurrence word list in correspondence with the query word;
Searching the document set database for query words and each important co-occurrence word stored in the important co-occurrence word list, and extracting a document containing these words;
Extracting at least one of a sentence and a phrase including the query word and each important co-occurrence word simultaneously from the extracted document and storing it in a co-occurrence expression database as a co-occurrence expression;
The information retrieval method according to claim 1, further comprising: outputting and displaying the co-occurrence expression stored in the co-occurrence expression database for each important co-occurrence word.

The computer device, when calculating the relevance, is a value obtained by multiplying the number of documents including a query word by the number of documents including a co-occurrence word, and the number of documents including both the query word and the co-occurrence word. The information search method according to claim 1, wherein a step of calculating a logarithm of a value obtained by dividing the value as the relevance is executed.

An information search system that presents a co-occurrence expression related to a query word as a search keyword from a document set database in which a plurality of documents are stored,
A document set database in which multiple documents are stored;
A search document database for storing documents searched for words input from the document set database;
A co-occurrence word list that stores co-occurrence words that co-occur with query words in the search document;
The number of matching documents obtained when searching the document set database with the query word, the number of matching documents when searching the document set database with the co-occurrence word, and the document with a pair of query word and co-occurrence word A conforming document count table for storing the conforming document count when the collective database is searched;
An important co-occurrence word list storing important co-occurrence words and query words satisfying a predetermined condition among the co-occurrence words,
A co-occurrence expression database that stores at least one of a sentence and a phrase simultaneously including a query word and a co-occurrence word;
A word input unit for inputting a word;
A word input from the word input unit is used as a query word, a document including the query word is searched from the document collection database to obtain a search document, and the search document is stored in the search document database corresponding to the query word. A conforming document search unit to
Each document stored in the search document database is divided into words, words existing around the query word are set as co-occurrence words, and the co-occurrence words are registered in the co-occurrence word list in correspondence with the query words. A co-occurrence word acquisition unit,
The number of conforming documents obtained when searching the document set database with the query word, and the number of conforming documents obtained when searching the document set database with each co-occurrence word registered in the co-occurrence word list A number of conforming documents obtained by storing the number of conforming documents obtained when searching the document set database with the query word and each co-occurrence word corresponding to the query word in the conforming document number table;
A relevance calculating unit that calculates the relevance of the query word and the co-occurrence word with reference to the matching document number table;
An important co-occurrence word storage that stores an important co-occurrence word in the important co-occurrence word list in association with the query word as an important co-occurrence word with a degree of association between the query word and the co-occurrence word satisfying a predetermined condition And
The document database is searched with the query words and each important co-occurrence word stored in the important co-occurrence word list, a document including these words is obtained, and the sentence including the query word and each important co-occurrence word at the same time And a co-occurrence expression collection unit that extracts at least one of the phrases from the target document and stores it in the co-occurrence expression database as a co-occurrence expression,
An information retrieval system comprising: an output unit that outputs and displays the co-occurrence expression stored in the co-occurrence expression collection unit for each important co-occurrence word.

The relevance calculator calculates a logarithm of a value obtained by dividing the number of documents including both the query word and the co-occurrence word by a value obtained by multiplying the number of documents including the query word and the number of documents including the co-occurrence word. The information search system according to claim 3, further comprising means for calculating the degree of association.

A computer program in an information search system comprising a computer device that presents a co-occurrence expression related to a query word as a search keyword from a document set database storing a plurality of documents,
A computer program comprising the processing steps according to any one of claims 1 to 3.