JP2013045182A - Information retrieval apparatus, method, and program - Google Patents

Information retrieval apparatus, method, and program Download PDF

Info

Publication number
JP2013045182A
JP2013045182A JP2011180966A JP2011180966A JP2013045182A JP 2013045182 A JP2013045182 A JP 2013045182A JP 2011180966 A JP2011180966 A JP 2011180966A JP 2011180966 A JP2011180966 A JP 2011180966A JP 2013045182 A JP2013045182 A JP 2013045182A
Authority
JP
Japan
Prior art keywords
document
list
expression
specific expression
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
JP2011180966A
Other languages
Japanese (ja)
Other versions
JP5639549B2 (en
Inventor
Naoki Fujita
尚樹 藤田
Yoshihito Yasuda
宜仁 安田
Ryoji Kataoka
良治 片岡
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nippon Telegraph and Telephone Corp
Original Assignee
Nippon Telegraph and Telephone Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nippon Telegraph and Telephone Corp filed Critical Nippon Telegraph and Telephone Corp
Priority to JP2011180966A priority Critical patent/JP5639549B2/en
Publication of JP2013045182A publication Critical patent/JP2013045182A/en
Application granted granted Critical
Publication of JP5639549B2 publication Critical patent/JP5639549B2/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

PROBLEM TO BE SOLVED: To enable a searcher to easily reach information by recommending keywords to be added to a retrieval word inputted by a user, in accordance with a local area.SOLUTION: An information retrieval method includes: a preprocessing step of extracting a document number, a unique expression, and a frequency of the unique expression from a document to generate a unique expression list, extracting a geographic expression from the document to generate a place name list including place names in a geographic zone, determining a co-occurrence frequency or a co-occurrence probability for unique expressions to co-occur to generate a co-occurrence unique expression list, and storing the generated lists in storage means; and a recommendation word extraction processing step of acquiring a retrieval word and a designated geographic range to determine a zone corresponding to the geographic range and acquiring a document list corresponding to the zone from the storage means to determine a recommendation word on the basis of a frequency of the unique expression and the co-occurrence frequency or the co-occurrence probability of the unique expressions on the document list.

Description

本発明は、情報検索装置及び方法及びプログラムに係り、特に、検索者が求める情報を取得するまでの時間を短縮するために、入力された検索語に対して追加語を推薦する装置及び方法、プログラムに関するものである。その中でも特に、検索者が地理的範囲を地図の表示範囲や緯度経度情報入力によって指定した上で、その範囲内に関係する文書を検索するサービスに関する。   The present invention relates to an information search apparatus, method, and program, and more particularly to an apparatus and method for recommending an additional word for an input search word in order to shorten the time required for acquiring information requested by a searcher, It is about the program. In particular, the present invention relates to a service in which a searcher designates a geographical range by inputting a map display range or latitude / longitude information, and searches for a document related to the range.

インターネット上の文書を検索する検索エンジンなどでは、ある検索者が入力した検索語(検索条件)に対して、全利用者の検索履歴を解析し、最適な検索語の推薦を行うことが可能である。特許文献1では、検索語と検索結果のページの類似性から入力された検索語と関連した検索語を推薦している。また、非特許文献1では、利用者が入力した検索語と選択した文書の内容を分析することで、検索語に追加する語を抽出している。   Search engines that search documents on the Internet can analyze the search history of all users for the search terms (search conditions) entered by a certain searcher and recommend the optimal search terms. is there. In Patent Document 1, a search word related to a search word input based on the similarity between the search word and the search result page is recommended. In Non-Patent Document 1, a word to be added to a search word is extracted by analyzing the search word input by the user and the content of the selected document.

また、非特許文献2では文書内のテキストを解析して地名情報を特定する手法が提案されており、この結果を用いることで各文書がどの地域に関係しているかを解析することが可能である。これを用いることで、予め用意しておいたキーワード集合がどの地域に関係付けられた文書に存在しているか解析することにより、特定の地域で推薦するキーワードを抽出することが可能と考えられる。具体的な手法としては、図1に示すように、地域を東西・南北それぞれ200mや緯度経度で8秒毎などの固定の値で区切り(メッシュと呼ぶ)、メッシュ毎に関連する文書集合中の各キーワードの頻度を分析する。複数のメッシュを含むある地域において、あるキーワードの頻度が全体の頻度分布中で特徴的に高い場合、そのキーワードは当該地域での推薦すべきキーワードであると判定できる。特徴的であるかは標準偏差やポアソン確率を用いることで判定可能である。   Non-Patent Document 2 proposes a method for identifying place name information by analyzing text in a document, and by using this result, it is possible to analyze which region each document relates to. is there. By using this, it is considered that a keyword recommended in a specific region can be extracted by analyzing in which document a keyword set prepared in advance exists in a document associated with the region. As a specific method, as shown in FIG. 1, the area is divided by fixed values such as 200 m each in east / west / north / south and every 8 seconds in latitude and longitude (called meshes), Analyze the frequency of each keyword. In a certain area including a plurality of meshes, if the frequency of a certain keyword is characteristically high in the overall frequency distribution, it can be determined that the keyword is a keyword to be recommended in the area. Whether it is characteristic or not can be determined by using standard deviation or Poisson probability.

特開2011−103020号公報JP 2011-103020 A

Cui, H., Wen, J.-R., Nie, J.-Y., and Ma, W.-Y.: Probabilistic Query Expansion Using Query Logs. In Proceedings of WWW'02, pp.325--332 (2002)Cui, H., Wen, J.-R., Nie, J.-Y., and Ma, W.-Y .: Probabilistic Query Expansion Using Query Logs. In Proceedings of WWW'02, pp.325--332 (2002) 平野 徹,他:地理的距離と有名度を用いた地名曖昧性解消 情報処理学会全国大会2008.Toru Hirano, et al .: Disambiguation of place names using geographical distance and famousness IPSJ National Convention 2008.

従来技術(特許文献1、非特許文献1)により検索履歴を解析することで検索者が入力した検索語に追加するキーワードを抽出できるが、検索履歴には地理的な関連性情報が含まれないため、地理に合わせたキーワードを推薦することはできない。また、地理に合わせて事前に用意しておいたキーワードを提示することは従来技術で可能と考えられるが(非特許文献2)、検索者が入力した検索語に関連した語を提示することはできない。   The keyword added to the search term input by the searcher can be extracted by analyzing the search history according to the prior art (Patent Document 1, Non-Patent Document 1), but the search history does not include geographical relevance information. Therefore, it is not possible to recommend keywords tailored to geography. In addition, although it is considered possible to present keywords prepared in advance according to geography (Non-Patent Document 2), it is possible to present words related to the search terms entered by the searcher. Can not.

本発明は、上記の点に鑑みなされたもので、地域に合わせてユーザが入力した検索語に追加するキーワードを推薦し、検索者が容易に情報に辿り着くことを可能とし、検索に要する時間を短縮可能な情報検索装置及び方法及びプログラムを提供することを目的とする。   The present invention has been made in view of the above points, recommends a keyword to be added to a search word input by a user according to a region, enables a searcher to easily reach information, and takes a time required for the search. It is an object of the present invention to provide an information search apparatus, method, and program capable of shortening.

上記の課題を解決するため、本発明(請求項1)は、地理に応じた推薦語を決定するための情報検索装置であって、
文書から文書番号と固有表現及び該固有表現を抽出し、固有表現リストを生成し、該文書から地理表現を抽出し、該地理上の区画の地名を含む地名リストを生成し、該固有表現同士の共起している共起頻度または共起確率を求め、共起固有表現リストを生成し、それらのリストを記憶手段に格納する事前処理手段と、
検索語と指定の地理範囲を取得し、該地理範囲に該当する区画を求め、前記記憶手段から該区画に対応する文書リストを取得して、該文書リスト内の固有表現の頻度と固有表現同士の共起頻度または共起確率に基づいて推薦語を決定する推薦語抽出処理手段と、
を有する。
In order to solve the above problems, the present invention (Claim 1) is an information search device for determining a recommended word according to geography,
A document number and a unique expression and the specific expression are extracted from a document, a specific expression list is generated, a geographical expression is extracted from the document, a place name list including the place names of the geographical divisions is generated, A co-occurrence co-occurrence frequency or co-occurrence probability, a co-occurrence specific expression list is generated, and a pre-processing means for storing these lists in a storage means;
A search term and a specified geographic area are acquired, a section corresponding to the geographic area is obtained, a document list corresponding to the section is acquired from the storage means, and the frequency of the specific expressions and the specific expressions in the document list are obtained. A recommended word extraction processing means for determining a recommended word based on the co-occurrence frequency or co-occurrence probability of
Have

また、本発明(請求項2)は、請求項1の情報検索装置において、
前記事前処理手段は、
文書から抽出した地名を含む固有表現と該固有表現の頻度の組合せに文書番号を対応付け、前記固有表現リストとして文書情報記憶手段に格納し、該固有表現のうち、地名からなる地名リストを生成する固有表現情報抽出手段と、
前記地名リストの地名から緯度経度情報を取得し、文書番号と組にして地理文書対応記憶手段に格納する地理情報解析手段と、
前記文書情報記憶手段の固有表現毎の全文書における共起頻度を算出し全体共起頻度記憶手段に格納する全体共起頻度計算手段と、
を有し、
前記推薦語抽出処理手段は、
検索者から入力された検索語と地図の地理範囲を取得する範囲情報取得手段と、
前記地理文書対応記憶手段から前記地理範囲に対応する文書番号を抽出する範囲内文書取得手段と、
前記範囲内文書取得手段で抽出された前記文書番号に対応する文書に出現する固有表現リストを前記文書情報記憶手段から取得して、前記検索語と該検索語が含まれる全ての固有表現リストの固有表現と該固有表現の頻度の組を固有表現毎に集計する範囲内共起頻度計算手段と、
前記検索語と前記固有表現リストの組合せを取得し、該固有表現リストに含まれる固有表現の頻度と、前記全体共起頻度記憶手段から該検索語に基づいて取得した固有表現の全文書における共起頻度よりポアソン確率を求め、該ポアソン確率の高い順に前記推薦語を抽出する推薦語抽出手段と、を有する。
The present invention (Claim 2) is the information search apparatus according to Claim 1,
The pre-processing means includes
A document number is associated with a combination of a unique expression including a place name extracted from a document and the frequency of the unique expression, stored in the document information storage means as the unique expression list, and a place name list including place names is generated from the unique expressions Specific expression information extraction means for
Geographic information analysis means for obtaining latitude / longitude information from the place names in the place name list and storing them in a document correspondence storage means in combination with a document number;
An overall co-occurrence frequency calculating means for calculating a co-occurrence frequency in all documents for each unique expression of the document information storage means and storing it in an overall co-occurrence frequency storage means;
Have
The recommended word extraction processing means includes:
Range information acquisition means for acquiring a search term input from a searcher and a geographical range of a map;
In-range document acquisition means for extracting a document number corresponding to the geographic range from the geographic document correspondence storage means;
A specific expression list appearing in a document corresponding to the document number extracted by the in-scope document acquisition unit is acquired from the document information storage unit, and the search term and all of the specific expression lists including the search term are included. In-range co-occurrence frequency calculation means for counting a set of specific expressions and the frequency of the specific expressions for each specific expression;
A combination of the search word and the specific expression list is acquired, and the frequency of the specific expression included in the specific expression list and the common expression in all the documents of the specific expression acquired based on the search word from the total co-occurrence frequency storage unit are acquired. And a recommended word extracting means for obtaining a Poisson probability from the occurrence frequency and extracting the recommended words in descending order of the Poisson probability.

また、本発明(請求項3)は、請求項1の情報検索装置において、
前記事前処理手段は、
文書から抽出した地名を含む固有表現と該固有表現の頻度の組合せに文書番号を対応付け、前記固有表現リストとして文書情報記憶手段に格納し、該固有表現のうち、地名からなる地名リストを生成する固有表現情報抽出手段と、
前記地名リストの地名から緯度経度情報を取得し、文書番号と組にして地理文書対応記憶手段に格納する地理情報解析手段と、
前記文書情報記憶手段の固有表現毎の全文書における共起確率を算出し全体共起確率記憶手段に格納する全体共起頻度計算手段と、
を有し、
前記推薦語抽出処理手段は、
検索者から入力された検索語と地図の地理範囲を取得する範囲情報取得手段と、
前記地理文書対応記憶手段から前記地理範囲に対応する文書番号を抽出する範囲内文書取得手段と、
前記範囲内文書取得手段で抽出された前記文書番号に対応する文書に出現する固有表現リストを前記文書情報記憶手段から取得して、前記検索語と該検索語が含まれる全ての固有表現リストの固有表現と該固有表現の共起確率の組を固有表現毎に集計する範囲内共起確率計算手段と、
前記検索語と前記固有表現リストの組合せを取得し、該固有表現リストに含まれる固有表現の共起確率と、前記全体共起確率記憶手段から該検索語に基づいて取得した固有表現の全文書における共起確率の差が大きい上位N語を前記推薦語として抽出する推薦語抽出手段と、を有する。
The present invention (Claim 3) is the information search apparatus according to Claim 1,
The pre-processing means includes
A document number is associated with a combination of a unique expression including a place name extracted from a document and the frequency of the unique expression, stored in the document information storage means as the unique expression list, and a place name list including place names is generated from the unique expressions Specific expression information extraction means for
Geographic information analysis means for obtaining latitude / longitude information from the place names in the place name list and storing them in a document correspondence storage means in combination with a document number;
A total co-occurrence frequency calculating means for calculating a co-occurrence probability in all documents for each unique expression of the document information storing means and storing the total co-occurrence probability storing means;
Have
The recommended word extraction processing means includes:
Range information acquisition means for acquiring a search term input from a searcher and a geographical range of a map;
In-range document acquisition means for extracting a document number corresponding to the geographic range from the geographic document correspondence storage means;
A specific expression list appearing in a document corresponding to the document number extracted by the in-scope document acquisition unit is acquired from the document information storage unit, and the search term and all of the specific expression lists including the search term are included. A within-range co-occurrence probability calculating means for totalizing a set of specific expressions and co-occurrence probabilities of the specific expressions for each specific expression;
A combination of the search word and the specific expression list is acquired, and the co-occurrence probability of the specific expression included in the specific expression list and all documents of the specific expression acquired based on the search word from the global co-occurrence probability storage unit And recommended word extraction means for extracting the top N words having a large difference in co-occurrence probability as the recommended words.

本発明(請求項4)は、地理に応じた推薦語を決定するための情報検索方法であって、
事前処理手段が、文書から文書番号と固有表現及び該固有表現を抽出し、固有表現リストを生成し、該文書から地理表現を抽出し、該地理上の区画の地名を含む地名リストを生成し、該固有表現同士の共起している共起頻度または共起確率を求め、共起固有表現リストを生成し、それらのリストを記憶手段に格納する事前処理ステップと、
推薦語抽出処理手段が、検索語と指定の地理範囲を取得し、該地理範囲に該当する区画を求め、前記記憶手段から該区画に対応する文書リストを取得して、該文書リスト内の固有表現の頻度と固有表現同士の共起頻度または共起確率に基づいて推薦語を決定する推薦語抽出処理ステップと、を行う。
The present invention (Claim 4) is an information search method for determining a recommended word according to geography,
The pre-processing means extracts the document number, the unique expression and the specific expression from the document, generates a specific expression list, extracts the geographical expression from the document, and generates a place name list including the place names of the geographical divisions A pre-processing step of obtaining a co-occurrence frequency or co-occurrence probability of the co-occurrence of the specific expressions, generating a co-occurrence specific expression list, and storing the list in a storage unit;
The recommended word extraction processing unit obtains the search term and the designated geographic range, obtains a section corresponding to the geographic range, obtains a document list corresponding to the section from the storage unit, and acquires a unique list in the document list. A recommended word extraction processing step of determining a recommended word based on the frequency of expression and the co-occurrence frequency or co-occurrence probability of specific expressions is performed.

また、本発明(請求項5)は、請求項4の前記事前処理ステップにおいて、
文書から抽出した地名を含む固有表現と該固有表現の頻度の組合せに文書番号を対応付け、前記固有表現リストとして文書情報記憶手段に格納し、該固有表現のうち、地名からなる地名リストを生成する固有表現情報抽出ステップと、
前記地名リストの地名から緯度経度情報を取得し、文書番号と組にして地理文書対応記憶手段に格納する地理情報解析ステップと、
前記文書情報記憶手段の固有表現毎の全文書における共起頻度を算出し全体共起頻度記憶手段に格納する全体共起頻度計算ステップと、
を行い、
前記推薦語抽出処理ステップにおいて、
検索者から入力された検索語と地図の地理範囲を取得する範囲情報取得ステップと、
前記地理文書対応記憶手段から前記地理範囲に対応する文書番号を抽出する範囲内文書取得ステップと、
前記範囲内文書取得ステップで抽出された前記文書番号に対応する文書に出現する固有表現リストを前記文書情報記憶手段から取得して、前記検索語と該検索語が含まれる全ての固有表現リストの固有表現と該固有表現の頻度の組を固有表現毎に集計する範囲内共起頻度計算ステップと、
前記検索語と前記固有表現リストの組合せを取得し、該固有表現リストに含まれる固有表現の頻度と、前記全体共起頻度記憶手段から該検索語に基づいて取得した固有表現の全文書における共起頻度よりポアソン確率を求め、該ポアソン確率の高い順に前記推薦語を抽出する推薦語抽出ステップと、を含む。
Further, the present invention (Claim 5) is the pre-processing step of Claim 4,
A document number is associated with a combination of a unique expression including a place name extracted from a document and the frequency of the unique expression, stored in the document information storage means as the unique expression list, and a place name list including place names is generated from the unique expressions Specific expression information extraction step,
Geographic information analysis step of acquiring latitude / longitude information from the place name in the place name list, and storing it in the geographic document correspondence storage means in combination with the document number;
A total co-occurrence frequency calculating step of calculating a co-occurrence frequency in all documents for each unique expression of the document information storage means and storing the total co-occurrence frequency storage means;
And
In the recommended word extraction processing step,
A range information acquisition step for acquiring a search term input from a searcher and a geographical range of the map;
An in-range document acquisition step of extracting a document number corresponding to the geographic range from the geographic document correspondence storage means;
A unique expression list appearing in the document corresponding to the document number extracted in the in-scope document acquisition step is acquired from the document information storage unit, and the search word and all of the specific expression lists including the search word are obtained. A within-range co-occurrence frequency calculation step of counting a combination of the specific expression and the frequency of the specific expression for each specific expression;
A combination of the search word and the specific expression list is acquired, and the frequency of the specific expression included in the specific expression list and the common expression in all the documents of the specific expression acquired based on the search word from the total co-occurrence frequency storage unit are acquired. And a recommended word extracting step of obtaining a Poisson probability from the occurrence frequency and extracting the recommended words in descending order of the Poisson probability.

また、本発明(請求項6)は、請求項4の前記事前処理ステップにおいて、
文書から抽出した地名を含む固有表現と該固有表現の頻度の組合せに文書番号を対応付け、前記固有表現リストとして文書情報記憶手段に格納し、該固有表現のうち、地名からなる地名リストを生成する固有表現情報抽出ステップと、
前記地名リストの地名から緯度経度情報を取得し、文書番号と組にして地理文書対応記憶手段に格納する地理情報解析ステップと、
前記文書情報記憶手段の固有表現毎の全文書における共起確率を算出し全体共起確率記憶手段に格納する全体共起頻度計算ステップと、
を有し、
前記推薦語抽出処理ステップにおいて、
検索者から入力された検索語と地図の地理範囲を取得する範囲情報取得ステップと、
前記地理文書対応記憶手段から前記地理範囲に対応する文書番号を抽出する範囲内文書取得ステップと、
前記範囲内文書取得ステップで抽出された前記文書番号に対応する文書に出現する固有表現リストを前記文書情報記憶手段から取得して、前記検索語と該検索語が含まれる全ての固有表現リストの固有表現と該固有表現の共起確率の組を固有表現毎に集計する範囲内共起確率計算ステップと、
前記検索語と前記固有表現リストの組合せを取得し、該固有表現リストに含まれる固有表現の共起確率と、前記全体共起確率記憶手段から該検索語に基づいて取得した固有表現の全文書における共起確率の差が大きい上位N語を前記推薦語として抽出する推薦語抽出ステップと、を含む。
Moreover, this invention (Claim 6) is the pre-processing step of Claim 4,
A document number is associated with a combination of a unique expression including a place name extracted from a document and the frequency of the unique expression, stored in the document information storage means as the unique expression list, and a place name list including place names is generated from the unique expressions Specific expression information extraction step,
Geographic information analysis step of acquiring latitude / longitude information from the place name in the place name list, and storing it in the geographic document correspondence storage means in combination with the document number;
A total co-occurrence frequency calculating step of calculating a co-occurrence probability in all documents for each unique expression of the document information storage means and storing the total co-occurrence probability storage means;
Have
In the recommended word extraction processing step,
A range information acquisition step for acquiring a search term input from a searcher and a geographical range of the map;
An in-range document acquisition step of extracting a document number corresponding to the geographic range from the geographic document correspondence storage means;
A unique expression list appearing in the document corresponding to the document number extracted in the in-scope document acquisition step is acquired from the document information storage unit, and the search word and all of the specific expression lists including the search word are obtained. A within-range co-occurrence probability calculation step of summing up a set of specific expressions and co-occurrence probabilities of the specific expressions for each specific expression;
A combination of the search word and the specific expression list is acquired, and the co-occurrence probability of the specific expression included in the specific expression list and all documents of the specific expression acquired based on the search word from the global co-occurrence probability storage unit And a recommended word extraction step of extracting the top N words having a large difference in co-occurrence probabilities as the recommended words.

本発明(請求項7)は、コンピュータを、
請求項1乃至3のいずれ1項に記載の情報検索装置の各手段として機能させるための情報検索プログラムである。
The present invention (Claim 7) provides a computer,
It is an information search program for functioning as each means of the information search device of any one of Claims 1 thru | or 3.

上記のように、本発明によれば、地理を考慮して検索者が入力した検索語に追加するキーワードを推薦することで、検索者が探している情報(推薦語)を容易に得られるようになる。   As described above, according to the present invention, it is possible to easily obtain information (recommended word) that the searcher is looking for by recommending a keyword to be added to the search word input by the searcher in consideration of geography. become.

従来技術における地域ごとの特徴語抽出を示す図である。It is a figure which shows the feature word extraction for every area in a prior art. 本発明の第1の実施の形態におけるインタフェースのイメージである。It is an image of the interface in the 1st Embodiment of this invention. 本発明の第1の実施の形態における事前処理部の構成図である。It is a block diagram of the pre-processing part in the 1st Embodiment of this invention. 本発明の第1の実施の形態における文書情報テーブル例である。It is an example of the document information table in the 1st Embodiment of this invention. 本発明の第1の実施の形態における地理文書対応テーブルの例である。It is an example of the geographical document corresponding | compatible table in the 1st Embodiment of this invention. 本発明の第1の実施の形態における全体共起頻度テーブルの例である。It is an example of the whole co-occurrence frequency table in the 1st Embodiment of this invention. 本発明の第1の実施の形態における推薦語抽出処理部の構成図である。It is a block diagram of the recommended word extraction process part in the 1st Embodiment of this invention. 本発明の第2の実施の形態における事前処理部の構成図である。It is a block diagram of the pre-processing part in the 2nd Embodiment of this invention. 本発明の第2の実施の形態における推薦語抽出処理部の構成図である。It is a block diagram of the recommended word extraction process part in the 2nd Embodiment of this invention. 本発明の第2の実施の形態における全体共起確立テーブルの例である。It is an example of the whole co-occurrence establishment table in the 2nd Embodiment of this invention. 本発明の効果を示す図である。It is a figure which shows the effect of this invention.

以下図面と共に、本発明の実施の形態を説明する。   Embodiments of the present invention will be described below with reference to the drawings.

[第1の実施の形態]
本発明を実施する際の処理は、「事前処理」と「推薦語抽出処理」に分けられる。「事前処理」は、文書収集時やシステム構築時に実施され、「推薦語抽出処理」に必要なテーブル類を作成するものである。
[First Embodiment]
The processing when implementing the present invention is divided into “pre-processing” and “recommended word extraction processing”. “Pre-processing” is performed at the time of document collection or system construction, and creates tables necessary for “recommended word extraction processing”.

「推薦語抽出処理」は、「事前処理」で作成しておいたテーブルと検索者が入力した検索語、表示している地図の範囲情報を用いて推薦語を抽出する処理である。   The “recommended word extraction process” is a process of extracting a recommended word using the table created in the “pre-processing”, the search word input by the searcher, and the range information of the displayed map.

以降、「事前処理」と「推薦語抽出処理」を順に説明する。   Hereinafter, “pre-processing” and “recommended word extraction processing” will be described in order.

但し、本発明を検索者に提供する具体的なインタフェースに関しては、図2に示すように、検索者が地理的範囲を地図の表示範囲や緯度経度情報入力によって指定した上で、その範囲内に関係する文書を検索するサービスとする。   However, with regard to a specific interface for providing the searcher with the present invention, as shown in FIG. 2, the searcher designates a geographical range by inputting a map display range or latitude / longitude information, and within that range. A service for searching related documents.

最初に「事前処理」を行う事前処理部について説明する。   First, the pre-processing unit that performs “pre-processing” will be described.

図3は、本発明の第1の実施の形態における事前処理部の構成を示す。   FIG. 3 shows the configuration of the pre-processing unit in the first embodiment of the present invention.

事前処理部10Aは、固有表現情報抽出部11A、地理情報解析部12A、全体共起頻度計算部13A、文書情報テーブル14A、地理文書対応テーブル15A、全体共起頻度テーブル16Aから構成される。   The pre-processing unit 10A includes a unique expression information extraction unit 11A, a geographic information analysis unit 12A, an overall co-occurrence frequency calculation unit 13A, a document information table 14A, a geographic document correspondence table 15A, and an overall co-occurrence frequency table 16A.

固有表現抽出部11Aは、文書のテキスト情報を解析して地名を含む固有表現を抽出する。抽出の際には、例えば、非特許文献2や特許文献「特開2010−128774号公報」等の手法を用いる。抽出された固有表現は文書内での出現頻度を合わせて固有表現リストとして、{(固有表現:頻度),…}のデータ形式に整形し、文書毎に固有の文書番号を入力順に付与して、文書番号と固有表現リストの組合せを図4に示す文書情報テーブル14Aに保存する。また、抽出された固有表現のうち、地名のみのリストである地名リストを作成して、文書番号と地名リストの組合せを地理情報解析部12Aに入力する。   The specific expression extraction unit 11A analyzes the text information of the document and extracts a specific expression including a place name. At the time of extraction, for example, a technique such as Non-Patent Document 2 or Patent Document “Japanese Patent Laid-Open No. 2010-128774” is used. The extracted specific expressions are combined into the data format of {(specific expression: frequency), ...} by combining the appearance frequencies in the document, and a unique document number is assigned to each document in the order of input. The combination of the document number and the specific expression list is stored in the document information table 14A shown in FIG. In addition, a place name list that is a list of place names only is created from the extracted unique expressions, and a combination of the document number and the place name list is input to the geographic information analysis unit 12A.

地理情報解析部12Aは、固有表現情報抽出部11Aから入力される各(文書番号:地名リスト)の組合せに対して以下の処理を行う。   The geographic information analysis unit 12A performs the following processing for each (document number: place name list) combination input from the unique expression information extraction unit 11A.

地名リストに含まれる各地名から緯度経度情報を取得する。取得する際には、インターネット上の地図サービス(google「登録商標」)のAPIを利用したり、国土交通省などの行政機関が作成しているデータベースを用いる。   Get latitude / longitude information from each place name included in the place name list. When acquiring, the API of a map service (google “registered trademark”) on the Internet is used, or a database created by an administrative organization such as the Ministry of Land, Infrastructure, Transport and Tourism is used.

各地名の緯度経度が含まれるメッシュ番号を計算する。メッシュ番号は、サービス提供エリアを南北8秒、東西8秒(度分秒単位)のメッシュに分割し、(日本の場合)南西端のメッシュを1番として、西に順に番号を付与していく。東端のメッシュまで番号を付与したら1つ北の行のメッシュを西から順に番号を引き続き付与し、北端の行まで繰り返す。   Calculate the mesh number including the latitude and longitude of each name. The mesh number is divided into 8 mesh north-south and 8 seconds east-west mesh (in units of degrees, minutes, and seconds). . After assigning a number to the east end mesh, assign a number to the mesh in the north row in order from the west, and repeat until the north end row.

得られたメッシュ番号に対して当該文書番号を付与して内部記憶領域に保存する。   The document number is assigned to the obtained mesh number and stored in the internal storage area.

各(文書番号:地名リスト)の組み合わせで処理が完了すると、内部記憶領域に各メッシュに対して対応する文書番号リスト{文書番号,・・・}が保存されていることになる。そのメッシュ番号と文書番号リストの組み合わせを図5に示す地理文書対応テーブル15Aに保存する。   When the processing is completed for each (document number: place name list) combination, the document number list {document number,...} Corresponding to each mesh is stored in the internal storage area. The combination of the mesh number and the document number list is stored in the geographic document correspondence table 15A shown in FIG.

全体共起頻度計算部13Aは、文書情報テーブル14Aの各文書内で固有表現同士が共起しているかを計算し、全文書での共起頻度を合計して、固有表現毎に共起固有表現リスト{(固有表現:全体共起頻度),(固有表現:全体共起頻度),…}を作成し、図6に示すような全体共起頻度テーブル16Aに保存する。   The total co-occurrence frequency calculation unit 13A calculates whether the specific expressions co-occur in each document of the document information table 14A, sums the co-occurrence frequencies in all documents, and determines the co-occurrence unique for each specific expression. An expression list {(specific expression: total co-occurrence frequency), (specific expression: total co-occurrence frequency),...} Is created and stored in the total co-occurrence frequency table 16A as shown in FIG.

ここで、固有表現w1に対する固有表現w2の全体共起頻度n(w1,w2)は下記の式(1)で表される。Dw1は固有表現w1を含む文書集合、tf(w2,d) はドキュメントd中の固有表現w2の頻度である。 Here, the total co-occurrence frequency n (w1, w2) of the specific expression w2 with respect to the specific expression w1 is expressed by the following equation (1). D w1 is a document set including the specific expression w1, and tf (w2, d) is the frequency of the specific expression w2 in the document d.

Figure 2013045182
次に、「推薦語抽出処理」を行う推薦語抽出処理部について説明する。
Figure 2013045182
Next, a recommended word extraction processing unit that performs “recommended word extraction processing” will be described.

図7は、本発明の第1の実施の形態における推薦語抽出処理部の構成を示す。   FIG. 7 shows the configuration of the recommended word extraction processing unit in the first embodiment of the present invention.

推薦語抽出処理部20Aは、メッシュ番号計算部21A、範囲内文書番号取得部22A、範囲内共起頻度計算部23A、推薦語抽出部24A、地理文書対応テーブル15A、文書情報テーブル14A、全体共起テーブル16Aから構成される。このうち、地理文書対応テーブル15A、文書情報テーブル14A、全体共起頻度テーブル16Aは、前述の事前処理部10で作成されたものである。   The recommended word extraction processing unit 20A includes a mesh number calculation unit 21A, an in-range document number acquisition unit 22A, an in-range co-occurrence frequency calculation unit 23A, a recommended word extraction unit 24A, a geographic document correspondence table 15A, and a document information table 14A. It is composed of a starting table 16A. Among these, the geographic document correspondence table 15A, the document information table 14A, and the overall co-occurrence frequency table 16A are created by the pre-processing unit 10 described above.

メッシュ番号計算部21Aは、検索者が入力手段を介して入力した検索語と地図の範囲情報を取得し、その範囲内に含まれるメッシュ番号を計算する。範囲情報とはユーザが検索時に表示している地図の南西端の緯度経度と北東端の緯度経度の組み合わせである。通常範囲内には複数のメッシュが含まれる。取得した検索語と計算したメッシュ番号リスト{メッシュ番号,メッシュ番号,・・・}を範囲内文書番号取得部22Aに入力する。   The mesh number calculation unit 21A acquires search terms and map range information input by the searcher via the input means, and calculates mesh numbers included in the range. The range information is a combination of the latitude / longitude at the southwest end and the latitude / longitude at the northeast end of the map displayed by the user at the time of search. The normal range includes a plurality of meshes. The acquired search word and the calculated mesh number list {mesh number, mesh number,...} Are input to the in-range document number acquisition unit 22A.

範囲内文書番号取得部22Aは、検索語とメッシュ番号リストの組み合わせを取得し、メッシュ番号リストに含まれるメッシュに含まれる文書の文書番号リストを地理文書対応テーブル15から取得し、1つの文書番号リストに重複無くまとめて検索語と共に範囲内共起頻度計算部23Aに入力する。   The in-range document number acquisition unit 22A acquires a combination of a search term and a mesh number list, acquires a document number list of documents included in a mesh included in the mesh number list from the geographic document correspondence table 15, and stores one document number. The list is put together without duplication and input to the in-range co-occurrence frequency calculation unit 23A together with the search term.

範囲内共起頻度計算部23Aは、検索語と文書番号リストの組み合わせを取得し、文書番号リストに含まれる各文書に出現する固有表現リストを文書番号に基づいて文書情報テーブル14Aから取得し、検索語が含まれている全ての固有表現リスト内の各要素(固有表現:頻度)を固有表現毎に合計し、頻度合計を含む1つの固有表現リストを作成する。取得した検索語と作成した固有表現リストを推薦語抽出部24Aに入力する。   The in-range co-occurrence frequency calculation unit 23A acquires a combination of a search term and a document number list, acquires a unique expression list appearing in each document included in the document number list from the document information table 14A based on the document number, Each element (specific expression: frequency) in all the specific expression lists including the search word is totaled for each specific expression, and one specific expression list including the total frequency is created. The acquired search term and the created unique expression list are input to the recommended word extraction unit 24A.

推薦語抽出部24Aは、検索語と固有表現リストの組み合わせを取得し、固有表現リストに含まれる各固有表現について、その頻度情報と、全体共起頻度テーブル16Aから検索語をキーとして取得して取得した共起固有表現リスト{(固有表現:全体共起頻度),(固有表現:全体共起頻度), ・・・}の情報を用いてポアソン確率を計算する。ポアソン確率P(r)は下記式(2)で計算される。   The recommended word extraction unit 24A acquires a combination of the search word and the specific expression list, acquires frequency information about each specific expression included in the specific expression list, and the search word from the overall co-occurrence frequency table 16A as a key. The Poisson probability is calculated using the information of the acquired co-occurrence specific expression list {(specific expression: global co-occurrence frequency), (specific expression: global co-occurrence frequency),. The Poisson probability P (r) is calculated by the following equation (2).

Figure 2013045182
上記の式(2)中に用いられている各変数は下記の通りである。
・n:全文書数
・s:全文書内での当該固有表現が検索語と共起している頻度
・k:当該範囲内で検索語と共起している全固有表現の頻度
・r:当該範囲内での当該固有表現が検索語として共起している頻度
推薦語抽出部24Aでの計算において、各変数は次のように得られる。
・n:事前に設定しておく。(文書情報テーブル14Aのレコード数)
・s:全体共起頻度テーブル16から取得した共起固有表現リスト中の当該固有表現の全体共起頻度
・k:範囲内共起頻度計算部23Aから取得した固有表現リストに含まれる検索語を除く全ての固有表現の頻度の合計
・r:範囲内共起頻度計算部23Aから取得した固有表現リストに含まれる当該固有表現の頻度
上記で計算されたポアソン確率の高い順に10語(固定数の語)を推薦語リストとして出力する。
Figure 2013045182
Each variable used in the above equation (2) is as follows.
N: number of all documents, s: frequency of occurrence of the specific expression in all documents co-occurs with the search word, k: frequency of all specific expressions co-occurring with the search word in the range, r: Frequency at which the specific expression within the range co-occurs as a search word In the calculation by the recommended word extraction unit 24A, each variable is obtained as follows.
N: Set in advance. (Number of records in the document information table 14A)
S: the total co-occurrence frequency of the specific expression in the co-occurrence specific expression list acquired from the global co-occurrence frequency table 16 k: a search term included in the specific expression list acquired from the in-range co-occurrence frequency calculation unit 23A The total frequency of all the specific expressions excluding r: the frequency of the specific expressions included in the specific expression list acquired from the in-range co-occurrence frequency calculation unit 23A (in a descending order of the Poisson probabilities calculated above) Word) as a recommended word list.

[第2の実施の形態]
上記の第1の実施の形態では、共起頻度を用いる例を示したが、共起頻度を共起確率とすることも可能である。
[Second Embodiment]
In the first embodiment described above, an example in which the co-occurrence frequency is used has been described. However, the co-occurrence frequency may be set as a co-occurrence probability.

その場合、事前処理部10Aの「全体共起頻度計算部13A」と推薦語抽出処理部20Aの「範囲内共起頻度計算部23A」と「推薦語抽出部24A」を、図8、図9のように「全体共起確率計算部13B」、「範囲内共起確率計算部23B」、「推薦語抽出部24B」変更することで実現できる。   In that case, the “total co-occurrence frequency calculation unit 13A” of the pre-processing unit 10A and the “in-range co-occurrence frequency calculation unit 23A” and the “recommended word extraction unit 24A” of the recommended word extraction processing unit 20A are shown in FIGS. Thus, it can be realized by changing “overall co-occurrence probability calculating unit 13B”, “in-range co-occurrence probability calculating unit 23B”, and “recommended word extracting unit 24B”.

まず、事前処理部10Bの全体共起確率計算部13Bは、文書情報テーブル14Bの各文書内で固有表現同士が共起しているかを計算し、共起する確率を計算して、固有表現毎に共起固有表現リスト{(固有表現:全体共起確率),(固有表現:全体共起確率),・・・}を作成し、図10に示す全体共起確率テーブル16Bに保存する。   First, the total co-occurrence probability calculation unit 13B of the pre-processing unit 10B calculates whether the specific expressions co-occur in each document of the document information table 14B, calculates the probability of co-occurrence, and calculates each specific expression. Are created in the co-occurrence specific expression list {(proprietary expression: global co-occurrence probability), (specific expression: global co-occurrence probability),...} And stored in the global co-occurrence probability table 16B shown in FIG.

ここで、固有表現w1に対する固有表現w2の全体共起確率p(w1,w2)は下記の式(3)で表される。Dw1は固有表現w1を含む文書集合、 Here, the total co-occurrence probability p (w1, w2) of the specific expression w2 with respect to the specific expression w1 is expressed by the following equation (3). D w1 is a document set including the proper expression w1,

Figure 2013045182
は、固有表現w1と固有表現w2を含む文書集合である。
Figure 2013045182
Is a document set including a specific expression w1 and a specific expression w2.

Figure 2013045182
次に、推薦語抽出処理部20Bの範囲内共起確率計算部23Bは、検索語と文書番号リストの組み合わせを取得し、文書番号リストに含まれる各文書に出現する固有表現リストを文書情報テーブル14Bから取得し、検索語が含まれている全固有表現リスト内の固有表現毎に出現する文書数を計算し、文書番号リストに含まれる文書数を用いて範囲内共起確率を計算する。計算した範囲内共起確率の情報を保持した固有表現リスト{(固有表現:範囲内共起確率),(固有表現:範囲内共起確率),・・・}を作成し、取得した検索語と共に推薦語抽出部24Bに入力する。
Figure 2013045182
Next, the within-range co-occurrence probability calculation unit 23B of the recommended word extraction processing unit 20B acquires a combination of the search word and the document number list, and displays the unique expression list appearing in each document included in the document number list in the document information table. 14B, the number of documents appearing for each specific expression in all the specific expression lists including the search term is calculated, and the in-range co-occurrence probability is calculated using the number of documents included in the document number list. Search term obtained by creating a specific expression list {(specific expression: in-range co-occurrence probability), (specific expression: in-range co-occurrence probability), ...} holding information on the calculated in-range co-occurrence probability At the same time, it is input to the recommended word extraction unit 24B.

推薦語抽出部24Bは、検索語と固有表現リストの組み合わせを取得し、固有表現リストに含まれる各固有表現について、その範囲内共起確率情報と、全体共起確率テーブル16Bから検索語をキーとして取得して取得した共起固有表現リスト{(固有表現:全体共起確率),(固有表現:全体共起確率), ・・・}の情報を用いて特徴的な語を抽出する。ここで特徴的な語とは、その固有表現と範囲内共起確率と全体共起頻度の差が大きいものとし、差の大きい順に10語(固定数の語)を推薦語リストとして出力する。   The recommended word extraction unit 24B acquires a combination of the search word and the specific expression list, and for each specific expression included in the specific expression list, the search word is keyed from the in-range co-occurrence probability information and the overall co-occurrence probability table 16B. A characteristic word is extracted using information of the co-occurrence specific expression list {(specific expression: global co-occurrence probability), (specific expression: global co-occurrence probability),. Here, the characteristic words are assumed to have a large difference in their proper expression, in-range co-occurrence probability, and overall co-occurrence frequency, and 10 words (a fixed number of words) are output as a recommended word list in descending order of the difference.

上記のように、推薦キーワード自体の地域分布を用いずに、入力された検索語と地域ごとの共起頻度もしくは共起確率から推薦キーワードを抽出する。これにより、図11に示すように、入力された検索語「お土産」に対して、横浜では「シュウマイ」、三崎では「マグロ」が推薦キーワードとして抽出され、ユーザが入力した検索語に地域を考慮した語を推薦することが可能となる。   As described above, the recommended keyword is extracted from the input search word and the co-occurrence frequency or co-occurrence probability for each region without using the regional distribution of the recommended keyword itself. As a result, as shown in FIG. 11, for the input search word “souvenir”, “shumai” in Yokohama and “tuna” in Misaki are extracted as recommended keywords, and the region is added to the search word input by the user. It is possible to recommend words that are considered.

なお、上記の図3、図7、図8、図9の装置の構成要素の動作をプログラムとして抽出し、情報検索装置として利用されるコンピュータにインストールする、または、ネットワークを介して流通させることが可能である。   The operations of the components of the devices shown in FIGS. 3, 7, 8, and 9 can be extracted as a program and installed in a computer used as an information search device, or distributed via a network. Is possible.

本発明は、上記の実施の形態に限定されることなく、特許請求の範囲内において、種々変更・応用が可能である。   The present invention is not limited to the above-described embodiments, and various modifications and applications are possible within the scope of the claims.

10A,10B 事前処理部
11A,11B 固有表現情報抽出部
12A,12B 地理情報解析部
13A 全体共起頻度計算部
13B 全体共起確率計算部
14A,14B 文書情報テーブル
15A,15B 地理文書対応テーブル
16A 全体共起頻度テーブル
16B 全体共起確率テーブル
20A,20B 推薦語抽出処理部
21A,21B メッシュ番号計算部
22A,22B 範囲内文書番号取得部
23A 範囲内共起確率計算部
23B 範囲内共起確率計算部
24A,24B 推薦語抽出部
10A, 10B Pre-processing unit 11A, 11B Specific expression information extraction unit 12A, 12B Geographic information analysis unit 13A Overall co-occurrence frequency calculation unit 13B Overall co-occurrence probability calculation unit 14A, 14B Document information table 15A, 15B Geographic document correspondence table 16A Overall Co-occurrence frequency table 16B Overall co-occurrence probability tables 20A, 20B Recommended word extraction processing units 21A, 21B Mesh number calculation units 22A, 22B In-range document number acquisition unit 23A In-range co-occurrence probability calculation unit 23B In-range co-occurrence probability calculation unit 24A, 24B Recommended word extraction unit

Claims (7)

地理に応じた検索語(以下「推薦語」と記す)を決定するための情報検索装置であって、
文書から文書番号と固有表現及び該固有表現を抽出し、固有表現リストを生成し、該文書から地理表現を抽出し、該地理上の区画の地名を含む地名リストを生成、該固有表現同士の共起している共起頻度または共起確率を求め、共起固有表現リストを生成し、それらのリストを記憶手段に格納する事前処理手段と、
検索語と指定の地理範囲を取得し、該地理範囲に該当する区画を求め、前記記憶手段から該区画に対応する文書リストを取得して、該文書リスト内の固有表現の頻度と固有表現同士の共起頻度または共起確率に基づいて推薦語を決定する推薦語抽出処理手段と、
を有することを特徴とする情報検索装置。
An information search device for determining a search word according to geography (hereinafter referred to as “recommended word”),
Extract a document number, a unique expression and the specific expression from a document, generate a specific expression list, extract a geographical expression from the document, generate a place name list including the place names of the geographical divisions, A pre-processing means for obtaining co-occurrence co-occurrence frequency or co-occurrence probability, generating a co-occurrence specific expression list, and storing the list in a storage means;
A search term and a specified geographic area are acquired, a section corresponding to the geographic area is obtained, a document list corresponding to the section is acquired from the storage means, and the frequency of the specific expressions and the specific expressions in the document list are obtained. A recommended word extraction processing means for determining a recommended word based on the co-occurrence frequency or co-occurrence probability of
An information retrieval apparatus comprising:
前記事前処理手段は、
文書から抽出した地名を含む固有表現と該固有表現の頻度の組合せに文書番号を対応付け、前記固有表現リストとして文書情報記憶手段に格納し、該固有表現のうち、地名からなる地名リストを生成する固有表現情報抽出手段と、
前記地名リストの地名から緯度経度情報を取得し、文書番号と組にして地理文書対応記憶手段に格納する地理情報解析手段と、
前記文書情報記憶手段の固有表現毎の全文書における共起頻度を算出し全体共起頻度記憶手段に格納する全体共起頻度計算手段と、
を有し、
前記推薦語抽出処理手段は、
検索者から入力された検索語と地図の地理範囲を取得する範囲情報取得手段と、
前記地理文書対応記憶手段から前記地理範囲に対応する文書番号を抽出する範囲内文書取得手段と、
前記範囲内文書取得手段で抽出された前記文書番号に対応する文書に出現する固有表現リストを前記文書情報記憶手段から取得して、前記検索語と該検索語が含まれる全ての固有表現リストの固有表現と該固有表現の頻度の組を固有表現毎に集計する範囲内共起頻度計算手段と、
前記検索語と前記固有表現リストの組合せを取得し、該固有表現リストに含まれる固有表現の頻度と、前記全体共起頻度記憶手段から該検索語に基づいて取得した固有表現の全文書における共起頻度よりポアソン確率を求め、該ポアソン確率の高い順に前記推薦語を抽出する推薦語抽出手段と
を有する請求項1記載の情報検索装置。
The pre-processing means includes
A document number is associated with a combination of a unique expression including a place name extracted from a document and the frequency of the unique expression, stored in the document information storage means as the unique expression list, and a place name list including place names is generated from the unique expressions Specific expression information extraction means for
Geographic information analysis means for obtaining latitude / longitude information from the place names in the place name list and storing them in a document correspondence storage means in combination with a document number;
An overall co-occurrence frequency calculating means for calculating a co-occurrence frequency in all documents for each unique expression of the document information storage means and storing it in an overall co-occurrence frequency storage means;
Have
The recommended word extraction processing means includes:
Range information acquisition means for acquiring a search term input from a searcher and a geographical range of a map;
In-range document acquisition means for extracting a document number corresponding to the geographic range from the geographic document correspondence storage means;
A specific expression list appearing in a document corresponding to the document number extracted by the in-scope document acquisition unit is acquired from the document information storage unit, and the search term and all of the specific expression lists including the search term are included. In-range co-occurrence frequency calculation means for counting a set of specific expressions and the frequency of the specific expressions for each specific expression;
A combination of the search word and the specific expression list is acquired, and the frequency of the specific expression included in the specific expression list and the common expression in all the documents of the specific expression acquired based on the search word from the total co-occurrence frequency storage unit are acquired. The information search apparatus according to claim 1, further comprising: a recommended word extraction unit that obtains a Poisson probability from an occurrence frequency and extracts the recommended words in descending order of the Poisson probability.
前記事前処理手段は、
文書から抽出した地名を含む固有表現と該固有表現の頻度の組合せに文書番号を対応付け、前記固有表現リストとして文書情報記憶手段に格納し、該固有表現のうち、地名からなる地名リストを生成する固有表現情報抽出手段と、
前記地名リストの地名から緯度経度情報を取得し、文書番号と組にして地理文書対応記憶手段に格納する地理情報解析手段と、
前記文書情報記憶手段の固有表現毎の全文書における共起確率を算出し全体共起確率記憶手段に格納する全体共起頻度計算手段と、
を有し、
前記推薦語抽出処理手段は、
検索者から入力された検索語と地図の地理範囲を取得する範囲情報取得手段と、
前記地理文書対応記憶手段から前記地理範囲に対応する文書番号を抽出する範囲内文書取得手段と、
前記範囲内文書取得手段で抽出された前記文書番号に対応する文書に出現する固有表現リストを前記文書情報記憶手段から取得して、前記検索語と該検索語が含まれる全ての固有表現リストの固有表現と該固有表現の共起確率の組を固有表現毎に集計する範囲内共起確率計算手段と、
前記検索語と前記固有表現リストの組合せを取得し、該固有表現リストに含まれる固有表現の共起確率と、前記全体共起確率記憶手段から該検索語に基づいて取得した固有表現の全文書における共起確率の差が大きい上位N語を前記推薦語として抽出する推薦語抽出手段と、
を有する請求項1記載の情報検索装置。
The pre-processing means includes
A document number is associated with a combination of a unique expression including a place name extracted from a document and the frequency of the unique expression, stored in the document information storage means as the unique expression list, and a place name list including place names is generated from the unique expressions Specific expression information extraction means for
Geographic information analysis means for obtaining latitude / longitude information from the place names in the place name list and storing them in a document correspondence storage means in combination with a document number;
A total co-occurrence frequency calculating means for calculating a co-occurrence probability in all documents for each unique expression of the document information storing means and storing the total co-occurrence probability storing means;
Have
The recommended word extraction processing means includes:
Range information acquisition means for acquiring a search term input from a searcher and a geographical range of a map;
In-range document acquisition means for extracting a document number corresponding to the geographic range from the geographic document correspondence storage means;
A specific expression list appearing in a document corresponding to the document number extracted by the in-scope document acquisition unit is acquired from the document information storage unit, and the search term and all of the specific expression lists including the search term are included. A within-range co-occurrence probability calculating means for totalizing a set of specific expressions and co-occurrence probabilities of the specific expressions for each specific expression;
A combination of the search word and the specific expression list is acquired, and the co-occurrence probability of the specific expression included in the specific expression list and all documents of the specific expression acquired based on the search word from the global co-occurrence probability storage unit A recommended word extracting means for extracting the top N words having a large difference in co-occurrence probabilities as the recommended words;
The information search device according to claim 1, comprising:
地理に応じた検索語(以下「推薦語」と記す)を決定するための情報検索方法であって、
事前処理手段が、文書から文書番号と固有表現及び該固有表現を抽出し、固有表現リストを生成し、該文書から地理表現を抽出し、該地理上の区画の地名を含む地名リストを生成し、該固有表現同士の共起している共起頻度または共起確率を求め、共起固有表現リストを生成し、それらのリストを記憶手段に格納する事前処理ステップと、
推薦語抽出処理手段が、検索語と指定の地理範囲を取得し、該地理範囲に該当する区画を求め、前記記憶手段から該区画に対応する文書リストを取得して、該文書リスト内の固有表現の頻度と固有表現同士の共起頻度または共起確率に基づいて推薦語を決定する推薦語抽出処理ステップと、
を行うことを特徴とする情報検索方法。
An information search method for determining a search word according to geography (hereinafter referred to as “recommended word”),
The pre-processing means extracts the document number, the unique expression and the specific expression from the document, generates a specific expression list, extracts the geographical expression from the document, and generates a place name list including the place names of the geographical divisions A pre-processing step of obtaining a co-occurrence frequency or co-occurrence probability of the co-occurrence of the specific expressions, generating a co-occurrence specific expression list, and storing the list in a storage unit;
The recommended word extraction processing unit obtains the search term and the designated geographic range, obtains a section corresponding to the geographic range, obtains a document list corresponding to the section from the storage unit, and acquires a unique list in the document list. A recommended word extraction processing step for determining a recommended word based on the frequency of expression and the co-occurrence frequency or co-occurrence probability of specific expressions;
An information retrieval method characterized by:
前記事前処理ステップにおいて、
文書から抽出した地名を含む固有表現と該固有表現の頻度の組合せに文書番号を対応付け、前記固有表現リストとして文書情報記憶手段に格納し、該固有表現のうち、地名からなる地名リストを生成する固有表現情報抽出ステップと、
前記地名リストの地名から緯度経度情報を取得し、文書番号と組にして地理文書対応記憶手段に格納する地理情報解析ステップと、
前記文書情報記憶手段の固有表現毎の全文書における共起頻度を算出し全体共起頻度記憶手段に格納する全体共起頻度計算ステップと、
を行い、
前記推薦語抽出処理ステップにおいて、
検索者から入力された検索語と地図の地理範囲を取得する範囲情報取得ステップと、
前記地理文書対応記憶手段から前記地理範囲に対応する文書番号を抽出する範囲内文書取得ステップと、
前記範囲内文書取得ステップで抽出された前記文書番号に対応する文書に出現する固有表現リストを前記文書情報記憶手段から取得して、前記検索語と該検索語が含まれる全ての固有表現リストの固有表現と該固有表現の頻度の組を固有表現毎に集計する範囲内共起頻度計算ステップと、
前記検索語と前記固有表現リストの組合せを取得し、該固有表現リストに含まれる固有表現の頻度と、前記全体共起頻度記憶手段から該検索語に基づいて取得した固有表現の全文書における共起頻度よりポアソン確率を求め、該ポアソン確率の高い順に前記推薦語を抽出する推薦語抽出ステップと
を含む請求項4記載の情報検索方法。
In the preprocessing step,
A document number is associated with a combination of a unique expression including a place name extracted from a document and the frequency of the unique expression, stored in the document information storage means as the unique expression list, and a place name list including place names is generated from the unique expressions Specific expression information extraction step,
Geographic information analysis step of acquiring latitude / longitude information from the place name in the place name list, and storing it in the geographic document correspondence storage means in combination with the document number;
A total co-occurrence frequency calculating step of calculating a co-occurrence frequency in all documents for each unique expression of the document information storage means and storing the total co-occurrence frequency storage means;
And
In the recommended word extraction processing step,
A range information acquisition step for acquiring a search term input from a searcher and a geographical range of the map;
An in-range document acquisition step of extracting a document number corresponding to the geographic range from the geographic document correspondence storage means;
A unique expression list appearing in the document corresponding to the document number extracted in the in-scope document acquisition step is acquired from the document information storage unit, and the search word and all of the specific expression lists including the search word are obtained. A within-range co-occurrence frequency calculation step of counting a combination of the specific expression and the frequency of the specific expression for each specific expression;
A combination of the search word and the specific expression list is acquired, and the frequency of the specific expression included in the specific expression list and the common expression in all the documents of the specific expression acquired based on the search word from the total co-occurrence frequency storage unit are acquired. 5. An information retrieval method according to claim 4, further comprising a recommended word extraction step of obtaining a Poisson probability from the occurrence frequency and extracting the recommended words in descending order of the Poisson probability.
前記事前処理ステップにおいて、
文書から抽出した地名を含む固有表現と該固有表現の頻度の組合せに文書番号を対応付け、前記固有表現リストとして文書情報記憶手段に格納し、該固有表現のうち、地名からなる地名リストを生成する固有表現情報抽出ステップと、
前記地名リストの地名から緯度経度情報を取得し、文書番号と組にして地理文書対応記憶手段に格納する地理情報解析ステップと、
前記文書情報記憶手段の固有表現毎の全文書における共起確率を算出し全体共起確率記憶手段に格納する全体共起頻度計算ステップと、
を有し、
前記推薦語抽出処理ステップにおいて、
検索者から入力された検索語と地図の地理範囲を取得する範囲情報取得ステップと、
前記地理文書対応記憶手段から前記地理範囲に対応する文書番号を抽出する範囲内文書取得ステップと、
前記範囲内文書取得ステップで抽出された前記文書番号に対応する文書に出現する固有表現リストを前記文書情報記憶手段から取得して、前記検索語と該検索語が含まれる全ての固有表現リストの固有表現と該固有表現の共起確率の組を固有表現毎に集計する範囲内共起確率計算ステップと、
前記検索語と前記固有表現リストの組合せを取得し、該固有表現リストに含まれる固有表現の共起確率と、前記全体共起確率記憶手段から該検索語に基づいて取得した固有表現の全文書における共起確率の差が大きい上位N語を前記推薦語として抽出する推薦語抽出ステップと、
を含む請求項4記載の情報検索方法。
In the preprocessing step,
A document number is associated with a combination of a unique expression including a place name extracted from a document and the frequency of the unique expression, stored in the document information storage means as the unique expression list, and a place name list including place names is generated from the unique expressions Specific expression information extraction step,
Geographic information analysis step of acquiring latitude / longitude information from the place name in the place name list, and storing it in the geographic document correspondence storage means in combination with the document number;
A total co-occurrence frequency calculating step of calculating a co-occurrence probability in all documents for each unique expression of the document information storage means and storing the total co-occurrence probability storage means;
Have
In the recommended word extraction processing step,
A range information acquisition step for acquiring a search term input from a searcher and a geographical range of the map;
An in-range document acquisition step of extracting a document number corresponding to the geographic range from the geographic document correspondence storage means;
A unique expression list appearing in the document corresponding to the document number extracted in the in-scope document acquisition step is acquired from the document information storage unit, and the search word and all of the specific expression lists including the search word are obtained. A within-range co-occurrence probability calculation step of summing up a set of specific expressions and co-occurrence probabilities of the specific expressions for each specific expression;
A combination of the search word and the specific expression list is acquired, and the co-occurrence probability of the specific expression included in the specific expression list and all documents of the specific expression acquired based on the search word from the global co-occurrence probability storage unit A recommended word extracting step of extracting, as the recommended word, the top N words having a large difference in co-occurrence probability in
The information search method according to claim 4, including:
コンピュータを、
請求項1乃至3のいずれ1項に記載の情報検索装置の各手段として機能させるための情報検索プログラム。
Computer
The information search program for functioning as each means of the information search device of any one of Claims 1 thru | or 3.
JP2011180966A 2011-08-22 2011-08-22 Information retrieval apparatus, method, and program Expired - Fee Related JP5639549B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2011180966A JP5639549B2 (en) 2011-08-22 2011-08-22 Information retrieval apparatus, method, and program

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP2011180966A JP5639549B2 (en) 2011-08-22 2011-08-22 Information retrieval apparatus, method, and program

Publications (2)

Publication Number Publication Date
JP2013045182A true JP2013045182A (en) 2013-03-04
JP5639549B2 JP5639549B2 (en) 2014-12-10

Family

ID=48009066

Family Applications (1)

Application Number Title Priority Date Filing Date
JP2011180966A Expired - Fee Related JP5639549B2 (en) 2011-08-22 2011-08-22 Information retrieval apparatus, method, and program

Country Status (1)

Country Link
JP (1) JP5639549B2 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2015230718A (en) * 2014-06-08 2015-12-21 和康 鈴木 Security device
JP2023000855A (en) * 2021-06-18 2023-01-04 ヤフー株式会社 Information processing device, information processing method, and information processing program

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107643835A (en) * 2017-10-19 2018-01-30 北京京东尚科信息技术有限公司 Drop-down word determines method, apparatus, electronic equipment and storage medium
CN108090196B (en) * 2017-12-22 2021-10-15 新奥(中国)燃气投资有限公司 Keyword management method and device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH1125108A (en) * 1997-07-02 1999-01-29 Matsushita Electric Ind Co Ltd Automatic extraction device for relative keyword, document retrieving device and document retrieving system using these devices
JP2008527503A (en) * 2004-12-30 2008-07-24 グーグル インコーポレイテッド Indexing documents according to geographical relevance
JP2009086903A (en) * 2007-09-28 2009-04-23 Nomura Research Institute Ltd Retrieval service device
JP2009245179A (en) * 2008-03-31 2009-10-22 Nomura Research Institute Ltd Document retrieval support device
JP2012089019A (en) * 2010-10-21 2012-05-10 Nippon Telegr & Teleph Corp <Ntt> Document retrieval keyword presentation apparatus and document retrieval keyword presentation program

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH1125108A (en) * 1997-07-02 1999-01-29 Matsushita Electric Ind Co Ltd Automatic extraction device for relative keyword, document retrieving device and document retrieving system using these devices
JP2008527503A (en) * 2004-12-30 2008-07-24 グーグル インコーポレイテッド Indexing documents according to geographical relevance
JP2009086903A (en) * 2007-09-28 2009-04-23 Nomura Research Institute Ltd Retrieval service device
JP2009245179A (en) * 2008-03-31 2009-10-22 Nomura Research Institute Ltd Document retrieval support device
JP2012089019A (en) * 2010-10-21 2012-05-10 Nippon Telegr & Teleph Corp <Ntt> Document retrieval keyword presentation apparatus and document retrieval keyword presentation program

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
CSNG201000308255; 藤坂 達也,北山 大輔,李 龍,角谷 和俊: '地域依存検索のための地域特徴語に基づくクエリ生成支援システム' 第1回データ工学と情報マネジメントに関するフォーラム-DEIMフォーラム-論文集 [online] , 20091225, 電子情報通信学会データ工学研究専門委員会 *
JPN6014014773; 藤坂 達也,北山 大輔,李 龍,角谷 和俊: '地域依存検索のための地域特徴語に基づくクエリ生成支援システム' 第1回データ工学と情報マネジメントに関するフォーラム-DEIMフォーラム-論文集 [online] , 20091225, 電子情報通信学会データ工学研究専門委員会 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2015230718A (en) * 2014-06-08 2015-12-21 和康 鈴木 Security device
JP2023000855A (en) * 2021-06-18 2023-01-04 ヤフー株式会社 Information processing device, information processing method, and information processing program
JP7453182B2 (en) 2021-06-18 2024-03-19 Lineヤフー株式会社 Information processing device, information processing method, and information processing program

Also Published As

Publication number Publication date
JP5639549B2 (en) 2014-12-10

Similar Documents

Publication Publication Date Title
Marine-Roig et al. Tourism analytics with massive user-generated content: A case study of Barcelona
US8166013B2 (en) Method and system for crawling, mapping and extracting information associated with a business using heuristic and semantic analysis
Antonyuk et al. Consolidated information web resource for online tourism based on data integration and geolocation
Hu et al. Spatial data infrastructures
JP2019149145A (en) Information search system
Atzmueller et al. Exploratory pattern mining on social media using geo-references and social tagging information
JP2007219655A (en) Facility information management system, facility information management method and facility information management program
Liu et al. A search and summary application for traffic events detection based on twitter data
US10216787B2 (en) Method, apparatus, and computer-readable medium for contextual data mining using a relational data set
Adhinugroho et al. Development of online travel Web scraping for tourism statistics in Indonesia
JP5221664B2 (en) Information map management system and information map management method
JP5639549B2 (en) Information retrieval apparatus, method, and program
Ardissono et al. Exploration of cultural heritage information via textual search queries
Autelitano et al. Spatio-temporal mining of keywords for social media cross-social crawling of emergency events
Granell et al. A scoping review on the use, processing and fusion of geographic data in virtual assistants
JP2006113984A (en) Information providing system, metadata collection analysis server, and computer program
Suresh Kumar et al. Multi-ontology based points of interests (MO-POIS) and parallel fuzzy clustering (PFC) algorithm for travel sequence recommendation with mobile communication on big social media
KR20110039120A (en) Content recommendation list providing system based on location or social relationship
Almeida et al. Where the streets have known names
Oliveira et al. Gazetteer enrichment for addressing urban areas: A case study
Perera et al. Smart Maps through Semantic Web, Social Media, and Sentiment Analysis
Fernandez-Marquez et al. E 2 mC: Improving Rapid Mapping with Social Network Information
JP2013206387A (en) Data retrieval system and data retrieval method
JP5801243B2 (en) Feature keyword recommendation device, method and program
Khruahong et al. Ontology design for Thailand travel industry

Legal Events

Date Code Title Description
A621 Written request for application examination

Free format text: JAPANESE INTERMEDIATE CODE: A621

Effective date: 20130910

RD02 Notification of acceptance of power of attorney

Free format text: JAPANESE INTERMEDIATE CODE: A7422

Effective date: 20131004

A977 Report on retrieval

Free format text: JAPANESE INTERMEDIATE CODE: A971007

Effective date: 20140314

A131 Notification of reasons for refusal

Free format text: JAPANESE INTERMEDIATE CODE: A131

Effective date: 20140408

A521 Request for written amendment filed

Free format text: JAPANESE INTERMEDIATE CODE: A523

Effective date: 20140609

TRDD Decision of grant or rejection written
A01 Written decision to grant a patent or to grant a registration (utility model)

Free format text: JAPANESE INTERMEDIATE CODE: A01

Effective date: 20141021

A61 First payment of annual fees (during grant procedure)

Free format text: JAPANESE INTERMEDIATE CODE: A61

Effective date: 20141024

R150 Certificate of patent or registration of utility model

Ref document number: 5639549

Country of ref document: JP

Free format text: JAPANESE INTERMEDIATE CODE: R150

LAPS Cancellation because of no payment of annual fees