JP2013117866A

JP2013117866A - Keyword place name pair extraction device, method and program

Info

Publication number: JP2013117866A
Application number: JP2011265119A
Authority: JP
Inventors: Nobuaki Hiroshima; 伸章廣嶋; Yoshihito Yasuda; 宜仁安田; Norifumi Katabuchi; 典史片渕; Yoshimasa Koike; 義昌小池; Ryoji Kataoka; 良治片岡
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2011-12-02
Filing date: 2011-12-02
Publication date: 2013-06-13
Anticipated expiration: 2031-12-02
Also published as: JP5583107B2

Abstract

PROBLEM TO BE SOLVED: To avoid that a pair cannot be extracted by a fact that a co-occurrence frequency is low, and to appropriately extract the pair without causing noise even when adding the co-occurrence frequencies together.SOLUTION: A keyword place name pair extraction device refers to a geographic database on the basis of an input keyword and a place name, acquires an aggregate of partial areas belonging to the place name, refers to a frequency database on the basis of each area of the partial area aggregate, acquires a whole frequency corresponding to the area and a co-occurrence frequency with the keyword of the area, and when determining that the keyword is widely distributed in the area in which the key word represents the place name, on the basis of the whole frequency and the co-occurrence frequency with the keyword, adds the co-occurrence frequencies of the partial areas together to update the frequency database, and when an aggregate of a pair of the keyword and the place name is input, searches the frequency database on the basis of the aggregate to extract a pair of a keyword and a place name other than the input pair.

Description

本発明は、キーワード地名ペア抽出装置及び方法及びプログラムに係り、特に、指定したキーワードと地名のペアを抽出するためのキーワード地名ペア抽出装置及び方法及びプログラムに関する。例えば、キーワードとして「みかん」が入力された場合、当該キーワードに対応する「愛媛」を抽出し、当該地名の地図を表示する技術に適用するためのキーワード地名ペア抽出装置及び方法及びプログラムに関する。 The present invention relates to a keyword place name pair extraction apparatus, method, and program, and more particularly, to a keyword place name pair extraction apparatus, method, and program for extracting a specified keyword and place name pair. For example, the present invention relates to a keyword place name pair extraction apparatus, method, and program for applying to a technique of extracting “Ehime” corresponding to the keyword and displaying a map of the place name when “Mikan” is input as a keyword.

さまざまな事物をキーワードとして指定した際、そのキーワードに関連する場所を知ることができれば便利である。たとえば、「餃子」をキーワードとして指定したときに、「餃子」に関連する場所として餃子で有名な「宇都宮」の地図が表示され、宇都宮で餃子を提供する店舗の情報が地図上に表示されれば有益である。そのためには、キーワードとそれに関連する地名のペアをデータとして保持しておく必要がある。 When various things are specified as keywords, it is convenient to know the location related to the keywords. For example, when “Gyoza” is specified as a keyword, a map of “Utsunomiya” famous for gyoza is displayed as a place related to “Gyoza”, and information on stores providing dumplings in Utsunomiya is displayed on the map. Is beneficial. In order to do so, it is necessary to store a pair of a keyword and its associated place name as data.

キーワードと地名のペアを抽出する技術としては、名前と職業などのような何らかの関係を持つペアを抽出する手法が提案されている。少量のペアを用意し、ペアからのパターンの抽出とパターンからペアの抽出を繰り返すことにより、大量のペアを抽出する。この手法をキーワードと地名に対して適用することにより、キーワードと地名からなる少量のペアから大量のペアを抽出することが可能である（例えば、非特許文献１参照）。 As a technique for extracting a pair of a keyword and a place name, a technique for extracting a pair having a certain relationship such as a name and an occupation has been proposed. A small number of pairs are prepared, and a large number of pairs are extracted by repeating extraction of patterns from the pairs and extraction of pairs from the patterns. By applying this method to keywords and place names, it is possible to extract a large number of pairs from a small amount of pairs consisting of keywords and place names (see, for example, Non-Patent Document 1).

Pantel, P., Pennacchiotti, M., Espresso: leveraging generic patterns for automatically harvesting semantic relations. COLING-ACL 2006.Pantel, P., Pennacchiotti, M., Espresso: leveraging generic patterns for automatically harvesting semantic relations. COLING-ACL 2006.

しかしながら、非特許文献１の手法では、キーワードと地名との共起頻度が高くないとペアとして抽出されないという問題があった。そのため、共起頻度の算出に用いるコーパスの規模が小さいような場合には、共起頻度が低くなり、ペアを抽出できなかった。この問題を解決するために、地名の表す地域に属する部分的な地域（例えば、市に属する区など）での共起頻度を足し合わせてその地名での共起頻度とすることが考えられる。しかし、一部の部分的な地域での共起頻度が高いような場合には、足し合わせることによりペアを抽出する上でのノイズとなり、不適切なペアが抽出されてしまうという問題があった。 However, the method of Non-Patent Document 1 has a problem in that it is not extracted as a pair unless the co-occurrence frequency of the keyword and the place name is high. For this reason, when the size of the corpus used for calculating the co-occurrence frequency is small, the co-occurrence frequency is low and a pair cannot be extracted. In order to solve this problem, it is conceivable to add the co-occurrence frequencies in the partial areas (for example, wards belonging to the city, etc.) belonging to the area indicated by the place name to obtain the co-occurrence frequency in the place name. However, when the frequency of co-occurrence in some partial areas is high, adding them will cause noise when extracting pairs, and inappropriate pairs will be extracted. .

本発明は上記の問題点に鑑みてなされたものであって、キーワードが地名の表す地域に広く分布しているかどうかを判定し、広く分布していれば部分的な地域での共起頻度を足し合わせてその地名とキーワードとの共起頻度とすることにより、共起頻度が低いことによりペアが抽出できないことを回避するとともに、足し合わせてもノイズとならずに適切にキーワードと地名のペアを抽出することを可能としたキーワード地名ペア抽出装置及び方法及びプログラムを提供することを目的とする。 The present invention has been made in view of the above problems, and determines whether or not a keyword is widely distributed in an area indicated by a place name. If the keyword is widely distributed, the co-occurrence frequency in a partial area is determined. Adding together the place name and keyword co-occurrence frequency avoids the inability to extract a pair due to low co-occurrence frequency, and also adds the keyword and place name pair appropriately without adding noise. It is an object of the present invention to provide a keyword place name pair extraction apparatus, method, and program capable of extracting a name.

上記の課題を解決するため、本発明（請求項１）は、入力されたキーワードと地名のペアの集合に対応するキーワードと地名のペアを抽出するキーワード地名ペア抽出装置であって、
地名と該地名の表す地域の属する部分的な地域の集合を格納した地理データベースと、
部分的な地域毎キーワードの分布の文書全体における全体頻度及び該地域内の複数のキーワードとの共起頻度を格納した頻度データベースと、
入力された地名に基づいて前記地理データベースを参照して、該地名に属する部分的な地域の集合を取得する部分地域取得手段と、
前記部分的な地域集合の各地域に基づいて前記頻度データベースを参照して、該地域に対応する全体頻度及び該地域のキーワードとの共起頻度を取得する頻度取得手段と、
前記全体頻度と前記キーワードとの共起頻度に基づいて、該キーワードが地名を表す地域に広く分布していると判断された場合は、部分的な地域の共起頻度を足し合わせて前記頻度データベースを更新する頻度統合手段と、
前記入力されたキーワードと地名のペアの集合に基づいて、前記頻度データベースを検索し、キーワードと地名のペアを抽出する地名ペア抽出手段と、を有する。 In order to solve the above problems, the present invention (Claim 1) is a keyword place name pair extraction device that extracts a pair of a keyword and a place name corresponding to a set of input keyword and place name pairs,
A geographic database storing a place name and a set of partial areas to which the area represented by the place name belongs;
A frequency database storing the overall frequency in the entire document of partial keyword distribution and the frequency of co-occurrence with a plurality of keywords in the region;
A partial area acquisition means for referring to the geographic database based on the input place name and acquiring a set of partial areas belonging to the place name;
Referring to the frequency database based on each region of the partial region set, a frequency acquisition means for acquiring a total frequency corresponding to the region and a co-occurrence frequency with the keyword of the region;
Based on the co-occurrence frequency of the overall frequency and the keyword, if it is determined that the keyword is widely distributed in the area representing the place name, the frequency database is obtained by adding the co-occurrence frequencies of partial areas. Frequency integration means to update
And a place name pair extraction unit that searches the frequency database based on the set of the input keyword and place name pairs and extracts a keyword and place name pair.

また、本発明（請求項２）は、前記頻度データベースの前記全体頻度から得られる分布とキーワードとの共起頻度から得られる分布との間の類似性を表す分布類似性スコアを算出する分布類似性スコア算出手段を有し、
前記頻度統合手段において、
前記分布類似性スコアに基づいて頻度を統合するかを判定し、統合する場合には、地域毎の共起頻度を統合して、前記頻度データベースを更新する。 Further, the present invention (Claim 2) is configured to calculate a distribution similarity score representing a similarity between a distribution obtained from the overall frequency of the frequency database and a distribution obtained from a co-occurrence frequency of keywords. Having sex score calculation means,
In the frequency integration means,
It is determined whether to integrate frequencies based on the distribution similarity score, and in the case of integration, the co-occurrence frequencies for each region are integrated and the frequency database is updated.

また、本発明（請求項３）は、前記分布類似性スコア算出手段において、
分布が類似しているほど値が小さくなるＫＬダイバージェンスを用い、
前記頻度統合手段において、
前記分布類似性スコアが所定の値より小さい場合に前記地域毎の共起頻度を統合すると判定する。 In the present invention (Claim 3), in the distribution similarity score calculation means,
Using KL divergence, the value becomes smaller as the distribution is similar,
In the frequency integration means,
When the distribution similarity score is smaller than a predetermined value, it is determined that the co-occurrence frequencies for each region are integrated.

本発明によれば、キーワードが地名の表す地域に広く分布しているかどうかを判定し、広く分布していれば部分的な地域での共起頻度を足し合わせてその地名とキーワードとの共起頻度とすることにより、共起頻度が低いことによりペアが抽出できないことを回避するとともに、足し合わせてもノイズとならずに適切にペアを抽出することができる。 According to the present invention, it is determined whether or not a keyword is widely distributed in the area represented by the place name. If the keyword is widely distributed, the co-occurrence of the place name and the keyword is added by adding the co-occurrence frequencies in the partial areas. By setting the frequency, it is possible to avoid that the pair cannot be extracted due to the low co-occurrence frequency, and it is possible to appropriately extract the pair without adding noise even if they are added together.

本発明の一実施の形態におけるキーワード地名ペア抽出装置の構成図である。It is a block diagram of the keyword place name pair extraction apparatus in one embodiment of this invention. 本発明の一実施の形態における地理データベースの例である。It is an example of the geographic database in one embodiment of this invention. 本発明の一実施の形態における頻度データベースの例である。It is an example of the frequency database in one embodiment of this invention. 本発明の一実施の形態における動作のフローチャートである。It is a flowchart of the operation | movement in one embodiment of this invention. 本発明の一実施の形態における分布類似性スコアの例（１）である。It is an example (1) of the distribution similarity score in one embodiment of the present invention. 本発明の一実施の形態における分布類似性スコアの例（２）である。It is an example (2) of the distribution similarity score in one embodiment of the present invention. 本発明の一実施の形態における分布類似性スコアの例（３）である。It is an example (3) of the distribution similarity score in one embodiment of the present invention.

以下図面と共に、本発明の実施の形態を説明する。 Embodiments of the present invention will be described below with reference to the drawings.

図１は、本発明の実施例であるキーワード地名ペア抽出装置１００の構成を示す図である。図１に示すキーワード地名ペア抽出装置１００は、部分地域取得部１と、頻度取得部２と、分布同一性スコア算出部３と、頻度統合部４と、キーワード地名ペア抽出部５と、地理データベース６と、頻度データベース７を有する。 FIG. 1 is a diagram showing a configuration of a keyword place name pair extraction device 100 according to an embodiment of the present invention. A keyword place name pair extraction device 100 shown in FIG. 1 includes a partial area acquisition unit 1, a frequency acquisition unit 2, a distribution identity score calculation unit 3, a frequency integration unit 4, a keyword place name pair extraction unit 5, and a geographic database. 6 and a frequency database 7.

図２に、地理データベース６の例を示す。同図に示すように、地理データベース６は、地名と当該地名の表す地域の属する部分的な地域の集合を格納する。 FIG. 2 shows an example of the geographic database 6. As shown in the figure, the geographic database 6 stores a place name and a set of partial areas to which the area represented by the place name belongs.

また、図３に頻度データベース７の例を示す。同図に示す頻度データベース７は、部分的な地域毎に、当該地域が文書中に存在する全体頻度及び地域に関係するキーワード毎の共起頻度を格納する。キーワード毎の共起頻度は、キーワードKの数分(n)格納されているものとする。 FIG. 3 shows an example of the frequency database 7. The frequency database 7 shown in the figure stores, for each partial area, the overall frequency in which the area exists in the document and the co-occurrence frequency for each keyword related to the area. It is assumed that the co-occurrence frequency for each keyword is stored for the number of keywords K (n).

部分地域取得部１は、地名に対し、地理データベース６を参照して、その地名の表す地域の属する部分的な地域の集合を取得する。 The partial area acquisition unit 1 refers to the geographic database 6 for a place name, and acquires a set of partial areas to which the area represented by the place name belongs.

頻度取得部２は、部分的な地域のそれぞれに関して、頻度データベース７を参照して、全体頻度およびキーワードとの共起頻度を取得する。 The frequency acquisition unit 2 refers to the frequency database 7 for each partial area, and acquires the overall frequency and the co-occurrence frequency with the keyword.

分布類似性スコア算出部３は、全体頻度から得られる分布とキーワードとの共起頻度から得られる分布との間の類似性を表す分布類似性スコアを算出する。 The distribution similarity score calculation unit 3 calculates a distribution similarity score representing the similarity between the distribution obtained from the overall frequency and the distribution obtained from the co-occurrence frequency of the keywords.

頻度統合部４は、分布類似性スコアをもとに頻度を統合すべきかどうかを判定し、統合すべきと判定された場合には地域ごとの共起頻度を統合して入力された地名の共起頻度とし、頻度データベース７を更新する。 The frequency integration unit 4 determines whether frequencies should be integrated based on the distribution similarity score. If it is determined that the frequencies should be integrated, the co-occurrence frequencies for each region are integrated to share the name of the place name input. The occurrence frequency is set, and the frequency database 7 is updated.

キーワード地名ペア抽出部５は、検索時に検索対象のキーワードと地名のペアの集合が入力されると、頻度データベース７を参照して入力されたペア以外のキーワードと地名のペアを抽出する。 The keyword place name pair extraction unit 5 extracts a pair of a keyword and place name other than the input pair by referring to the frequency database 7 when a set of search target keyword and place name pairs is inputted at the time of search.

次に、キーワード地名ペア抽出装置１００の動作をより具体的に説明する。 Next, the operation of the keyword place name pair extraction apparatus 100 will be described more specifically.

図４は、本発明の一実施の形態における動作のフローチャートである。 FIG. 4 is a flowchart of the operation in one embodiment of the present invention.

以下の処理において、ステップ１〜５は、頻度データベース更新のための処理であり、ステップ６は、入力されたキーワードと地名の集合に対する検索処理である。 In the following process, steps 1 to 5 are processes for updating the frequency database, and step 6 is a search process for a set of input keywords and place names.

ステップ１）部分地域取得部１では、地名集合が入力されると、各地名に対し、地理データベース６を参照して、その地名の表す地域の属する部分的な地域の集合を取得する。ここでは、地名として「A」が入力されたものとする。地理データベース６の例を図２に示す。「A」に対応する部分的な地域として、「a1,a2,a3,a4,a5,a6」が得られる。部分的な地域の取得方法はこれに限るものではなく、地名「A」を表す住所と前方一致する住所を持つ地域を部分的な地域としたりしても構わない。 Step 1) When a set of place names is input, the partial area acquisition unit 1 refers to the geographic database 6 for each place name and obtains a set of partial areas to which the area represented by the place name belongs. Here, it is assumed that “A” is input as the place name. An example of the geographic database 6 is shown in FIG. As a partial region corresponding to “A”, “a1, a2, a3, a4, a5, a6” are obtained. The method of acquiring a partial area is not limited to this, and an area having an address that coincides with the address representing the place name “A” may be used as a partial area.

ステップ２）頻度取得部２では、部分的な地域のそれぞれに関して、頻度データベース７を参照して、当該部分的な地域が文書中に存在する全体頻度および当該地域に関連するキーワードとの共起頻度を取得する。図３の例では、部分地域「a1」に関して、頻度データベース７から全体頻度およびキーワードK1との共起頻度を取得すると、それぞれ8000、8となる。ここでは、利用する全体頻度として文書頻度を用いる。利用する全体頻度は、各地域における分布が反映されているものであればどのようなものでもよく、その地域における対象とするキーワードの頻度の合計などを用いたりしても構わない。 Step 2) The frequency acquisition unit 2 refers to the frequency database 7 for each of the partial areas, and the total frequency that the partial area exists in the document and the co-occurrence frequency with the keywords related to the area. To get. In the example of FIG. 3, when the total frequency and the co-occurrence frequency with the keyword K1 are acquired from the frequency database 7 for the partial area “a1”, they become 8000 and 8, respectively. Here, the document frequency is used as the overall frequency to be used. The total frequency used may be any as long as the distribution in each region is reflected, and the total frequency of the target keywords in that region may be used.

上記の全体頻度とは、キーワード毎の頻度に対して扱うデータ全体に関する頻度を指す。本実施の形態では、文書頻度を用いるため、扱っている文書中で各地域に関する文書（例えば、その地名を本文に含む文書）が1000文書あった場合には、全体頻度は1000となる。 The above-mentioned overall frequency refers to the frequency related to the entire data to be handled with respect to the frequency for each keyword. In the present embodiment, since the document frequency is used, if there are 1000 documents related to each region (for example, a document including the place name in the text) among the documents being handled, the total frequency is 1000.

ステップ３）分布類似性スコア算出部３では、全体頻度から得られる分布とキーワードとの共起頻度から得られる分布との間の類似性を表す分布類似性スコアを算出する。ここでは、全体頻度から得られる分布Pとキーワードとの共起頻度から得られる分布Qとの間の類似性を表す分布類似性スコアとして、KLダイバージェンスを用いる。 Step 3) The distribution similarity score calculation unit 3 calculates a distribution similarity score representing the similarity between the distribution obtained from the overall frequency and the distribution obtained from the co-occurrence frequency of the keywords. Here, KL divergence is used as a distribution similarity score representing the similarity between the distribution P obtained from the overall frequency and the distribution Q obtained from the co-occurrence frequency of keywords.

利用する分布類似性スコアは、全体頻度から得られる分布とキーワードとの共起頻度から得られる分布との間の類似性を表していればどのようなものでもよく、JSダイバージェンスを用いたりしても構わない。

The distribution similarity score to be used may be anything as long as it represents the similarity between the distribution obtained from the overall frequency and the distribution obtained from the co-occurrence frequency of the keyword, such as using JS divergence. It doesn't matter.

図５〜図７にKLダイバージェンスに基づいて判定した分布類似性スコアの例を示す。キーワードK1に関してKLダイバージェンスを算出する。頻度データベース７の全体頻度から文書の分布P(a1)を算出すると、
P(a1)=8000/(8000+4000+2000+1000+500+500)
となる。a2などについても同様である。キーワードの分布Q(a1)を算出すると、
Q(a1)=8/(8+6+4+4+4+4)
となる。a2などについても同様である。これにより、P(a1)log(P(a1)/Q(a1))=0.314となる。a2などについても同様である。最終的な分布類似性スコアは0.224となる。キーワードK2,K3についても同様に分布類似性スコアを算出すると、それぞれ0.520、0.044となる。 5 to 7 show examples of distribution similarity scores determined based on KL divergence. KL divergence is calculated for the keyword K1. When the document distribution P (a1) is calculated from the overall frequency of the frequency database 7,
P (a1) = 8000 / (8000 + 4000 + 2000 + 1000 + 500 + 500)
It becomes. The same applies to a2. When the keyword distribution Q (a1) is calculated,
Q (a1) = 8 / (8 + 6 + 4 + 4 + 4 + 4)
It becomes. The same applies to a2. As a result, P (a1) log (P (a1) / Q (a1)) = 0.314. The same applies to a2. The final distribution similarity score is 0.224. When the distribution similarity score is similarly calculated for the keywords K2 and K3, they are 0.520 and 0.044, respectively.

ステップ４）頻度統合部４では、分布類似性スコアをもとに頻度を統合すべきかどうかを判定し、統合すべきと判定された場合は、ステップ５に移行し、統合の必要がない場合はステップ６に移行する。KLダイバージェンスの値は分布が類似しているほど値が小さくなり、分布がまったく同一の場合に0で最小値となるため、ここでは分布類似性スコアが0.3以下のものを統合すべきと判定することにする。統合すべきかどうかの判定方法はこれに限るものではなく、異なる閾値を用いたり、割合を用いたりしても構わない。キーワードK1およびK3については統合すべきであると判定されるため、a1からa6までのK1、K3との共起頻度を統合して地名AのキーワードK1、K3との共起頻度とし、頻度データベース７を更新する。キーワードK2については統合を行わない。 Step 4) The frequency integration unit 4 determines whether or not the frequencies should be integrated based on the distribution similarity score. If it is determined that the frequencies should be integrated, the process proceeds to step 5; Move on to step 6. The KL divergence value becomes smaller as the distributions are similar, and becomes 0 at the minimum when the distributions are exactly the same, so it is determined here that distribution similarity scores of 0.3 or less should be integrated I will decide. The determination method of whether to integrate is not restricted to this, A different threshold value or a ratio may be used. Since it is determined that the keywords K1 and K3 should be integrated, the co-occurrence frequencies of K1 and K3 from a1 to a6 are integrated into the co-occurrence frequencies of the keyword K1 and K3 of the place name A, and the frequency database 7 is updated. No integration is performed for keyword K2.

ステップ５）頻度統合部４は、ステップ４で統合すべきと判定された場合には地域ごとの共起頻度を統合して入力された地名の共起頻度とし、頻度データベース７を更新する。 Step 5) If it is determined in step 4 that the frequency integration unit 4 should be integrated, the frequency occurrence unit 7 updates the frequency database 7 by integrating the co-occurrence frequencies of the respective areas into the co-occurrence frequency of the place name input.

ステップ６）キーワード地名ペア抽出部５では、キーワードと地名のペアの集合が入力されると、頻度データベース７を参照してキーワードと地名のペアを抽出する。キーワードと地名のペアの抽出方法としては、Espresso（登録商標）などの方法を適用することができる。 Step 6) When the keyword place name pair extraction unit 5 receives a set of keyword and place name pairs, the keyword place name pair extraction unit 5 refers to the frequency database 7 and extracts a pair of keyword and place name. A method such as Espresso (registered trademark) can be applied as a method for extracting a pair of a keyword and a place name.

具体的には、
＜うどん，香川＞
＜牛タン，仙台＞
＜みかん，愛媛＞
のような少量のキーワードと地名のペアを入力として、当該入力以外の
＜たこ焼き，大阪＞
＜ひつまぶし，名古屋＞
…
のような大量のキーワードと地名のペアを新たに獲得することができる。 In particular,
<Udon, Kagawa>
<Beef tongue, Sendai>
<Mandarin orange, caress>
<Takoyaki, Osaka> other than that input
<Hitsumabushi, Nagoya>
...
A large number of keyword / place name pairs can be acquired.

このように、キーワードK1やK3のように地名の表す地域に広く分布していれば部分的な地域での共起頻度を足し合わせてその地名とキーワードとの共起頻度として頻度データベース７を更新することができるため、検索時に入力されたキーワードと地名のペアの集合に対するペアの抽出に利用することができる。また、キーワードK2のように特定の部分的な地域に偏って共起するような場合は統合を行わないため、ノイズを発生させずに適切にペアを抽出することができる。 Thus, if it is widely distributed in the area indicated by the place name such as keywords K1 and K3, the frequency database 7 is updated as the co-occurrence frequency of the place name and the keyword by adding the co-occurrence frequencies in the partial areas. Therefore, it can be used to extract a pair for a set of keyword and place name pairs input at the time of search. Further, since the integration is not performed when the keyword K2 co-occurs in a specific partial area, a pair can be appropriately extracted without generating noise.

なお、上記の図１に示すキーワード地名ペア抽出装置の構成要素の一連の動作をプログラムとして構築し、キーワード地名ペア抽出装置として利用されるコンピュータにインストールして実行させる、または、ネットワークを介して流通させることが可能である。 In addition, a series of operations of the constituent elements of the keyword place name pair extraction device shown in FIG. 1 is constructed as a program, installed and executed on a computer used as the keyword place name pair extraction device, or distributed through a network. It is possible to make it.

本発明は、上記の実施の形態に限定されることなく、特許請求の範囲内において、種々変更・応用が可能である。 The present invention is not limited to the above-described embodiments, and various modifications and applications are possible within the scope of the claims.

１部分地域取得部
２頻度取得部
３分布類似性スコア算出部
４頻度統合部
５キーワード地名ペア抽出部
６地理データベース
７頻度データベース DESCRIPTION OF SYMBOLS 1 Partial area acquisition part 2 Frequency acquisition part 3 Distribution similarity score calculation part 4 Frequency integration part 5 Keyword place name pair extraction part 6 Geographic database 7 Frequency database

Claims

A keyword place name pair extraction device that extracts a keyword and place name pair corresponding to a set of input keyword and place name pairs,
A geographic database storing a place name and a set of partial areas to which the area represented by the place name belongs;
A frequency database storing the overall frequency in the entire document of partial keyword distribution and the frequency of co-occurrence with a plurality of keywords in the region;
A partial area acquisition means for referring to the geographic database based on the input place name and acquiring a set of partial areas belonging to the place name;
Referring to the frequency database based on each region of the partial region set, a frequency acquisition means for acquiring a total frequency corresponding to the region and a co-occurrence frequency with the keyword of the region;
Based on the co-occurrence frequency of the overall frequency and the keyword, if it is determined that the keyword is widely distributed in the area representing the place name, the frequency database is obtained by adding the co-occurrence frequencies of partial areas. Frequency integration means to update
Based on the set of input keyword / place name pairs, the frequency database is searched, and place name pair extraction means for extracting a keyword / place name pair;
A keyword place name pair extraction device characterized by comprising:

A distribution similarity score calculating means for calculating a distribution similarity score representing a similarity between a distribution obtained from the overall frequency of the frequency database and a distribution obtained from a co-occurrence frequency of keywords;
The frequency integration means includes
Determining whether to integrate frequencies based on the distribution similarity score, if integrating, integrate the co-occurrence frequency for each region, and update the frequency database,
The keyword place name pair extraction device according to claim 1.

The distribution similarity score calculation means includes:
Using KL (Kullback Leibler) divergence, the value becomes smaller as the distribution is similar,
The frequency integration means includes
The keyword place name pair extraction device according to claim 2, wherein when the distribution similarity score is smaller than a predetermined value, it is determined that the co-occurrence frequencies for each region are integrated.

A keyword place name pair extraction method for extracting a keyword and place name pair corresponding to a set of input keyword and place name pairs,
A geographic database storing a place name and a set of partial areas to which the area represented by the place name belongs;
A frequency database storing the overall frequency in the entire document of partial keyword distribution and the frequency of co-occurrence with a plurality of keywords in the region;
In a device having
A partial area acquisition unit refers to the geographic database based on the input place name, and acquires a partial area set belonging to the place name;
A frequency acquisition unit that refers to the frequency database based on each region of the partial region set and acquires a total frequency corresponding to the region and a co-occurrence frequency with the keyword of the region;
If the frequency integration means determines that the keyword is widely distributed in the area representing the place name based on the co-occurrence frequency of the overall frequency and the keyword, the co-occurrence frequency of the partial area is added. A frequency integration step of updating the frequency database together;
A place name pair extraction unit searches the frequency database based on the set of the input keyword and place name pairs, and extracts a place name pair of keywords and place names; and
The keyword place name pair extraction method characterized by performing.

Distribution similarity score calculation means for calculating a distribution similarity score representing a similarity between a distribution obtained from the overall frequency of the frequency database and a distribution obtained from a co-occurrence frequency of keywords. Do the steps,
In the frequency integration step,
Determining whether to integrate frequencies based on the distribution similarity score, if integrating, integrate the co-occurrence frequency for each region, and update the frequency database,
The keyword place name pair extraction method according to claim 4.

In the distribution similarity score calculation step,
Using KL divergence, the value becomes smaller as the distribution is similar,
In the frequency integration step,
The keyword place name pair extraction method according to claim 5, wherein when the distribution similarity score is smaller than a predetermined value, it is determined that the co-occurrence frequencies for each region are integrated.

Computer
A keyword place name pair extraction program that functions as each means of the keyword place name pair extraction apparatus according to any one of claims 1 to 3.