JP2010026773A

JP2010026773A - Geographical feature information extraction method and system

Info

Publication number: JP2010026773A
Application number: JP2008187212A
Authority: JP
Inventors: Shinji Ota; 慎司太田
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2008-07-18
Filing date: 2008-07-18
Publication date: 2010-02-04
Anticipated expiration: 2028-07-18
Also published as: JP5224453B2

Abstract

<P>PROBLEM TO BE SOLVED: To properly extract feature information, similar areas, or similar words related to geographical areas from a document on the Internet. <P>SOLUTION: A document acquisition part 11 acquires a document including an area name, and a morphological analysis part 14 decomposes each document into parts of speech, and a holding part 15 calculates the number of documents in which each word appears for each area name. First and second weight calculation parts 16 and 17 calculate the first and second weights of each word for each area name. A subject word extraction part 19 extracts words or a word group whose first weight is high as subject words related to each area. An inter-area similarity calculation part 20 defines the group of the second weight as a feature vector of the area, and calculates the inter-area similarity. A similar word extraction part 22 extracts words or a word group having similarity between similar areas as similar words. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は、地理的特徴情報抽出方法およびシステムに関し、特に、インターネットなどのネットワーク上に存在する文書(テキストデータ)から地理的エリアに関する特徴情報を抽出する地理的特徴情報抽出方法およびシステムに関する。 The present invention relates to a geographic feature information extraction method and system, and more particularly to a geographic feature information extraction method and system for extracting feature information about a geographic area from a document (text data) existing on a network such as the Internet.

インターネット上のウエブ(Web)やブログ(Blog)などの文書からテキストマイニング手法を駆使して有用な知見を得る試みがなされている。テキストマイニング手法には、文書から特徴的な単語を抽出する特徴語抽出手法と文書間の類似性を調べる類似性抽出手法が存在する。 Attempts have been made to obtain useful knowledge from documents such as webs and blogs on the Internet using text mining techniques. Text mining methods include a feature word extraction method for extracting characteristic words from documents and a similarity extraction method for examining similarities between documents.

特徴語抽出手法としては、TF-IDF法が広く知られている。TF-IDF法には様々な変形があるが、いずれにしても「より多く出現し、より少ない文書に偏って出現する単語ほど大きなスコアとなる」ように定義されたスコアを各単語ごとに算出し、これにより算出されたスコアの大きな単語を特徴語として抽出する。 The TF-IDF method is widely known as a feature word extraction method. There are various modifications to the TF-IDF method, but in any case, a score is defined for each word that is defined as "a word that appears more frequently and appears more biased in fewer documents". Then, a word with a large score calculated as a result is extracted as a feature word.

類似性抽出手法では、一般的に、比較対象の文書それぞれに含まれている単語群から特徴ベクトルを作成し、特徴ベクトル同士の内積や距離を算出することにより文書間の類似度を求める。 In the similarity extraction method, generally, a feature vector is created from a group of words included in each comparison target document, and the similarity between documents is obtained by calculating an inner product and a distance between the feature vectors.

非特許文献１には、主題語からの話題語抽出手法に関し、検索キーワードとなる主題語に関連性が高い話題語を抽出する技術が記載されている。ここでは、特に“p(主題語)のt(話題語)”というフレーズが多くの場合に成立するということに着目し、まず、“pの”という文字列をクリエとして検索エンジンに送り、「の」以降に続く名詞を抽出して話題語tの候補群とする。次に、話題語tの候補群のランキングから主題語pに関連性が高い話題語tを抽出する。具体的には、主題語pを含む文書群における話題語tを含む文書群の割合、および話題語tを含む文書群における主題語pを含む文書群の割合を求め、それらの割合の積を指標として話題語のランキング化を試みている。 Non-Patent Document 1 describes a technique for extracting a topic word that is highly relevant to a subject word serving as a search keyword with respect to a topic word extraction method from a subject word. Here, especially focusing on the fact that the phrase “p (subject word) t (topic word)” is established in many cases. First, the character string “p” is sent to the search engine as a query. Nouns following “” are extracted and set as a candidate group of the topic word t. Next, a topic word t highly relevant to the subject word p is extracted from the ranking of the candidate group of the topic word t. Specifically, the ratio of the document group including the topic word t in the document group including the subject word p and the ratio of the document group including the subject word p in the document group including the topic word t are obtained, and the product of these ratios is obtained. We are trying to rank topic words as an index.

非特許文献２には、ウエブ地域情報の自動要約のための特徴キーワード抽出手法に関し、特徴キーワードの抽出技術を、GIS(Geographic Information System)のような地理的情報を扱う分野へ応用することが記載されている。 Non-Patent Document 2 describes a feature keyword extraction technique for automatic summarization of web area information, and that the feature keyword extraction technique is applied to a field that handles geographic information such as GIS (Geographic Information System). Has been.

特許公報１，２には、特徴語抽出手法に関し、特徴ベクトルを利用して文書群中の単語または単語列の重要度を測る単語重要度計算方法が記載されている。これでは、まず、重要度を計算すべき単語Tを含む部分文書集合D(T)内の単語分布と全文書集合D0内の単語分布の間の距離dを計算する。次に、全文書集合D0からランダム選出された、部分文書集合D(T)と同数の単語数を含む部分集合Dと、全文書集合D0との距離d'の推定値を計算する。そして、距離dとd'を比較し、両者の差を単語の重要度としている。 Patent publications 1 and 2 describe a word importance calculation method for measuring the importance of a word or a word string in a document group using a feature vector with respect to a feature word extraction method. In this case, first, the distance d between the word distribution in the partial document set D (T) including the word T whose importance is to be calculated and the word distribution in the entire document set D0 is calculated. Next, an estimated value of the distance d ′ between the subset D including the same number of words as the subset document set D (T) selected at random from the total document set D0 and the total document set D0 is calculated. Then, the distances d and d ′ are compared, and the difference between the two is used as the importance of the word.

特許公報３には、類似性抽出手法に関し、入力される各文書と各カテゴリごとに用意された学習文書との類似度を算出し、この類似度から各文書をカテゴライズする情報分類方法が記載されている。これでは、各カテゴリごとに学習文書を用意し、学習文書から得られる単語群の重要度を鑑みて生成された特徴ベクトルを利用して各文書と学習文書との類似度を算出し、この類似度から文書をカテゴライズする。
特開２００１−６７３６２号公報特開２００３−９９４２７号公報特開平１１−１６７５８１号公報野田武史他4名,「主題語からの話題語自動抽出とこれに基づくWeb情報検索」,情報処理学会研究報告2006-DSB-140(II),pp305-311 中戸隆一郎他1名,「ウェブ地域情報の自動要約のための特徴キーワード抽出」DEWS2005 5-C-03(2005) Patent Publication 3 describes an information classification method for calculating a similarity between each input document and a learning document prepared for each category, and categorizing each document based on the similarity regarding a similarity extraction method. ing. Here, a learning document is prepared for each category, and the similarity between each document and the learning document is calculated using a feature vector generated in consideration of the importance of the word group obtained from the learning document. Categorize documents from time to time.
JP 2001-67362 A JP 2003-99427 A Japanese Patent Application Laid-Open No. 11-167581 Takeshi Noda and 4 others, "Automatic topic word extraction from theme words and Web information retrieval based on this", IPSJ SIG 2006-DSB-140 (II), pp305-311 Ryuichiro Nakato and one other, "Feature Keyword Extraction for Automatic Summarization of Web Region Information" DEWS2005 5-C-03 (2005)

本発明は、特に、インターネットなどのネットワーク上に存在する文書から地理的エリアに関する特徴情報を自動的に抽出する方法およびシステムに関するものであるが、これを実現するために、従来の特徴語抽出手法を用いた場合、以下のような課題が生じる。 In particular, the present invention relates to a method and system for automatically extracting feature information relating to a geographical area from a document existing on a network such as the Internet. In order to realize this, a conventional feature word extraction method is disclosed. The following problems arise when using.

TF-IDF法をベースとして利用する場合、TF-IDF法では「より多く出現し、より少ない文書に偏って出現する単語ほど大きなスコアとなる」ようにスコアが定義されるので、単語の出現回数がスコアに大きく寄与する。また、同一文書での出現回数が多い単語ほどスコアが大きくなる。 When using the TF-IDF method as a base, the TF-IDF method defines the score so that `` words that appear more often and appear biased in fewer documents have a higher score '', so the number of occurrences of the word Greatly contributes to the score. Moreover, the score increases as the number of appearances in the same document increases.

このため、例えば「・・・する。」，「・・・行く。」など、一般的に高頻度で使用される単語のスコアは大きくなる。したがって、地理的エリアと関係しない単語が特徴情報として抽出されてしまい、このような単語を特徴情報から排除することが困難である。また、例えば、スパム的なブログ文書では、同一文書で同じ単語が繰り返し使用されて強調されることが多く、この単語のスコアが大きくなる。したがって、例えば、広告で繰り返し用いられる単語のように、地理的エリアに関係しない単語であっても、それらが特徴情報として抽出されてしまう。 For this reason, for example, the score of a word that is generally used frequently, such as “... Therefore, words that are not related to a geographical area are extracted as feature information, and it is difficult to exclude such words from the feature information. In addition, for example, in a spam blog document, the same word is frequently used and emphasized in the same document, and the score of this word increases. Therefore, for example, even words that are not related to a geographical area, such as words that are repeatedly used in advertisements, are extracted as feature information.

このように、TF-IDF法をベースとして利用した場合、一般的に高頻度で使用される単語や特定の文書内での出現頻度が高い単語のスコアが大きくなる傾向があるので、地理的エリアと関係しない単語が特徴情報として抽出されてしまい、それらを排除することが困難となるという課題が生じる。 In this way, when using the TF-IDF method as a base, the score of words that are generally used frequently or frequently appear in a specific document tends to increase, so the geographical area This causes a problem that words that are not related to are extracted as feature information and it is difficult to eliminate them.

非特許文献１の特徴語抽出手法は、“p(主題語)のt(話題語)”というフレーズを含む文を解析対象としている。解析対象を特定の文構造に限定すると、解析に十分な数のサンプルを入手するためのコストおよび負担が大きくなるという課題が生じる。特定の文構造に限定されている手法を一般的な文書からの特徴情報抽出に適用することはできない。 The feature word extraction method of Non-Patent Document 1 analyzes a sentence including the phrase “p (subject word) t (topic word)”. If the analysis target is limited to a specific sentence structure, there arises a problem that the cost and burden for obtaining a sufficient number of samples for analysis increase. A method limited to a specific sentence structure cannot be applied to feature information extraction from a general document.

非特許文献２には、特徴キーワードの抽出技術の地理的情報を扱う分野への応用が考えられているが、収集したWebページ集合をクラスタリングし、各クラスタからの特徴キーワード抽出では、TF-IDF法を用いている。 Non-Patent Document 2 considers the application of feature keyword extraction technology to the field dealing with geographic information, but clustering collected Web page sets, and extracting feature keywords from each cluster, TF-IDF The law is used.

特許公報１，２の特徴語抽出手法によれば、TF-IDF法におけるような単語の出現頻度に起因した課題を排除することができる。しかしながら、解析対象の文書内に出現する全ての単語に関して特徴ベクトルを生成する必要があるので、文書内に含まれる単語数が多くなればなるほど、計算コストが大きくなるという課題が生じる。 According to the feature word extraction methods of Patent Publications 1 and 2, it is possible to eliminate problems caused by the appearance frequency of words as in the TF-IDF method. However, since it is necessary to generate feature vectors for all words appearing in the document to be analyzed, there arises a problem that the calculation cost increases as the number of words included in the document increases.

特許公報３の類似性抽出手法は、学習文書を予め用意する必要がある。また、算出される類似性は、学習文書と分類対象の文書の間でのものであり、分類対象の文書間での類似度は算出されない。また、地理的エリアに関する特徴情報は、ダイナミックに変化しているので、学習文書を常に最新のものに更新する必要がある。文書間の類似度は、比較する文書それぞれに含まれている単語群から特徴ベクトルを作成し、特徴ベクトル同士の内積や距離から算出できるが、そのための適切な特徴ベクトルを作成する必要がある。 The similarity extraction method of Patent Publication 3 needs to prepare a learning document in advance. Further, the calculated similarity is between the learning document and the classification target document, and the similarity between the classification target documents is not calculated. In addition, since the feature information regarding the geographical area is dynamically changing, it is necessary to always update the learning document to the latest one. The similarity between documents can be calculated from the inner product or distance between feature vectors created from a group of words included in each document to be compared. However, it is necessary to create an appropriate feature vector.

本発明の目的は、インターネットなどのネットワーク上に存在する文書から地理的エリアに関係する特徴情報、さらには類似エリアや類似語を適切に抽出することができる地理的特徴情報抽出方法およびシステムを提供することにある。 An object of the present invention is to provide a feature information extraction method and system capable of appropriately extracting feature information related to a geographic area, and also a similar area and similar words from a document existing on a network such as the Internet. There is to do.

上記課題を解決するため、本発明は、地理的なエリア名をキーとして該エリア名を含む複数の文書を取得する文書取得部と、前記文書取得部により取得された全文書を解析対象文書群とし、該解析対象文書群の各文書のデータを品詞に分解する形態素解析部と、各エリア名に対して取得された文書を解析対象エリア文書群とし、前記解析対象文書群に出現する個々の単語について、前記形態素解析部により得られた品詞を参照して個々の単語が出現する文書数を前記解析対象エリア文書群ごとに保持する単語出現文書数保持部と、前記解析対象文書群に出現する個々の単語について、前記解析対象エリア文書群内の、文書総数に対する当該単語を含む文書数の割合を第１の寄与度として算出し、前記解析対象文書群内の、当該単語を含む文書数に対する、当該単語を含む文書群内で当該エリア名を含む文書数の割合を第２の寄与度として算出し、前記第１の寄与度および前記第２の寄与度から各単語の第１の重要度を各エリア名ごとに算出する第１の重要度算出部と、前記第１の重要度が高い単語あるいは単語群を当該エリアに属する話題語として抽出する話題語抽出部を備えた点に第１の特徴がある。 In order to solve the above-described problems, the present invention provides a document acquisition unit that acquires a plurality of documents that include a geographical area name as a key, and all documents acquired by the document acquisition unit are analyzed document groups. And a morphological analysis unit that decomposes the data of each document of the analysis target document group into parts of speech, and the document acquired for each area name is set as the analysis target area document group, and each individual appearing in the analysis target document group A word appearance document number holding unit that holds the number of documents in which individual words appear with reference to the part of speech obtained by the morphological analysis unit for each word, and appears in the analysis target document group For each word to be calculated, a ratio of the number of documents including the word to the total number of documents in the analysis target area document group is calculated as a first contribution, and the document including the word in the analysis target document group The ratio of the number of documents including the area name in the document group including the word is calculated as a second contribution, and the first importantness of each word is calculated from the first contribution and the second contribution. A first importance calculation unit that calculates a degree for each area name, and a topic word extraction unit that extracts a word or word group having a high first importance as a topic word belonging to the area. There is one feature.

また、本発明は、さらに、前記解析対象文書群に出現する個々の単語について、前記解析対象文書群内の、文書総数に対する、当該単語を含む文書群内で当該エリア名を含む文書数の割合を各単語の第２の重要度として各エリア名ごとに算出する第２の重要度算出部と、各エリア名ごとの前記第２の重要度の集合を当該エリアの特徴ベクトルとし、各エリア間の特徴ベクトルの類似度をエリア間類似度として算出するエリア間類似度算出部と、前記エリア間類似度に基づいて類似エリアを抽出する類似エリア抽出部を備えた点に第２の特徴がある。 Further, the present invention further relates to a ratio of the number of documents including the area name in the document group including the word to the total number of documents in the analysis target document group for each word appearing in the analysis target document group. A second importance level calculation unit for each area name as a second importance level of each word, and a set of the second importance levels for each area name as a feature vector of the area, There is a second feature in that it includes an inter-area similarity calculation unit that calculates the similarity of the feature vectors as an inter-area similarity, and a similar area extraction unit that extracts a similar area based on the inter-area similarity. .

また、本発明は、前記第２の重要度算出部が、前記解析対象文書群内の、文書総数に対する当該単語を含む文書数の割合を第３の寄与度として算出する手段と、前記第２の寄与度と前記第３の寄与度の積を前記第２の重要度として算出する手段を有する点に第３の特徴がある。 According to the present invention, the second importance calculation unit calculates a ratio of the number of documents including the word to the total number of documents in the analysis target document group as a third contribution, and the second There is a third feature in that it has means for calculating the product of the third contribution and the third contribution as the second importance.

また、本発明は、さらに、類似エリア間について、前記エリア間類似度算出部がエリア間類似度を算出する過程で得られる、前記特徴ベクトルの要素である各単語ごとの類似度を保持する単語類似度保持部と、前記単語ごとの類似度に基づいて類似エリア間での類似単語あるいは類似単語群を抽出する類似語抽出部を備えた点に第４の特徴がある。 In addition, the present invention further provides a word holding similarity for each word, which is an element of the feature vector, obtained in the process of calculating the similarity between areas by the similarity calculation unit between similar areas. A fourth feature is that a similarity holding unit and a similar word extraction unit that extracts a similar word or a group of similar words between similar areas based on the similarity for each word are provided.

さらに、本発明は、前記解析対象エリア文書群の文書について、少なくとも重複を排除する文書フィルタ部を備え、該文書フィルタ部を通して得られる全解析対象エリア文書群を前記解析対象文書群とする点に第５の特徴がある。 Furthermore, the present invention includes a document filter unit that eliminates at least duplication of documents in the analysis target area document group, and sets all analysis target area document groups obtained through the document filter unit as the analysis target document group. There is a fifth feature.

なお、本発明は、システムとしてだけでなく、各部の機能を実行するステップを備えた方法としても実現できる。 In addition, this invention is realizable not only as a system but as a method provided with the step which performs the function of each part.

本発明の第１の特徴によれば、地理的なエリア各々についての話題語を抽出できる。ここで、解析対象文書群に出現する個々の単語について、解析対象エリア文書群内の、文書総数に対する当該単語を含む文書数の割合を第１の寄与度として算出し、解析対象文書群内の、当該単語を含む文書数に対する、当該単語を含む文書群内で当該エリア名を含む文書数の割合を第２の寄与度として算出し、第１の寄与度および第２の寄与度から各単語の第１の重要度を各エリア名ごとに算出する。これにより、エリア名をキーとして多くの解析対象文書のサンプルを揃え、解析することができる。また、第１の寄与度と第２の寄与度を文書数の割合として算出し、これを話題語抽出のための指標として導入しているので、特定の文書内に繰り返し出現する単語の影響を低減できる。さらに、第２の寄与度は、エリア間での相対的な単語の重要度を表しており、これを話題語抽出のための指標として導入しているので、全エリアで一般的に使用される単語の影響を低減できる。 According to the first feature of the present invention, topic words for each geographical area can be extracted. Here, for each word appearing in the analysis target document group, the ratio of the number of documents including the word to the total number of documents in the analysis target area document group is calculated as a first contribution, The ratio of the number of documents including the area name in the document group including the word to the number of documents including the word is calculated as the second contribution, and each word is calculated from the first contribution and the second contribution. Is calculated for each area name. This makes it possible to prepare and analyze a large number of analysis target documents using the area name as a key. In addition, the first contribution and the second contribution are calculated as a ratio of the number of documents, and this is introduced as an index for extracting topic words. Therefore, the influence of words that repeatedly appear in a specific document can be reduced. Can be reduced. Furthermore, the second contribution level represents the relative importance of words between areas, and is introduced as an index for extracting topic words, so it is generally used in all areas. The influence of words can be reduced.

また、第２，３の特徴によれば、話題語が類似するエリアを抽出できる。ここで、解析対象文書群に出現する個々の単語について、解析対象文書群内の、文書総数に対する、当該単語を含む文書群内で当該エリア名を含む文書数の割合を各単語の第２の重要度として各エリア名ごとに算出する。この算出は、解析対象文書群内の、総文書数に対する当該単語を含む文書数の割合を第３の寄与度として算出し、第２の寄与度と第３の寄与度の積を第２の重要度とすることと同じである。第２の寄与度には、エリアで共起性が高い単語に対して高い値を与える特性があり、第３の寄与度には、解析対象文書群内の総文書内で使用頻度の高い単語に対して高い値を与える特性がある。したがって、第２の重要度の各エリア名ごとの集合を当該エリアの特徴ベクトルとすることにより、使用頻度が低い単語の影響を低減しつつ、エリア間の類似度を適切に算出できる。 In addition, according to the second and third features, areas with similar topic words can be extracted. Here, for each word appearing in the analysis target document group, the ratio of the number of documents including the area name in the document group including the word to the total number of documents in the analysis target document group is set to the second of each word. The importance is calculated for each area name. In this calculation, the ratio of the number of documents including the word to the total number of documents in the analysis target document group is calculated as the third contribution, and the product of the second contribution and the third contribution is calculated as the second contribution. It is the same as taking importance. The second contribution has a characteristic of giving a high value to words having high co-occurrence in the area, and the third contribution is a word that is frequently used in the total documents in the analysis target document group. Has a characteristic of giving a high value. Therefore, by using the set of each area name of the second importance as the feature vector of the area, it is possible to appropriately calculate the similarity between areas while reducing the influence of words with low usage frequency.

また、第４の特徴によれば、類似エリア間について、類似をもたらす単語あるいは単語群を適切に抽出できる。 Further, according to the fourth feature, it is possible to appropriately extract words or word groups that bring about similarity between similar areas.

さらに、第５の特徴によれば、解析対象エリア文書群中の文書の重複をなくして処理負担を軽減できる。さらにエリア名と関係しない文書領域、名詞や未知語の割合が高い文書領域を解析対象文書群から排除すれば、さらに処理負担を軽減できる。 Furthermore, according to the fifth feature, it is possible to reduce the processing burden by eliminating duplication of documents in the analysis target area document group. Furthermore, if a document area not related to the area name, or a document area with a high ratio of nouns or unknown words is excluded from the analysis target document group, the processing load can be further reduced.

以下、図面を参照して本発明を説明する。図１は、本発明に係る地理的特徴情報抽出システムの一実施形態を示すブロック図である。本実施形態の地理的特徴情報抽出システム10は、文書取得部11、文書保持部12、文書フィルタ部13、形態素解析部14、単語出現有無保持部15、第１の重要度算出部16、第２の重要度算出部17、単語重要度保持部18、話題語抽出部19、エリア間類似度算出部20、単語類似度保持部21および類似語抽出部22を備える。なお、上記した各部分は、ハードウエアでもソフトウエアでも実現できる。また、本発明は、各部の機能を実行するステップを備えた方法としても実現できる。 The present invention will be described below with reference to the drawings. FIG. 1 is a block diagram showing an embodiment of a geographical feature information extraction system according to the present invention. The geographical feature information extraction system 10 of this embodiment includes a document acquisition unit 11, a document holding unit 12, a document filter unit 13, a morpheme analysis unit 14, a word appearance presence / absence holding unit 15, a first importance level calculation unit 16, 2, an importance level calculating unit 17, a word importance level holding unit 18, a topic word extracting unit 19, an inter-area similarity calculating unit 20, a word similarity holding unit 21, and a similar word extracting unit 22. Each part described above can be realized by hardware or software. The present invention can also be realized as a method including a step of executing the function of each unit.

文書取得部11は、インターネットなどのネットワーク1に接続されたサーバ2,3,・・・にアクセスし、地理的なエリア名(例えば、渋谷、秋葉原など)をキーとし、該エリア名を含む複数の文書(テキストデータ)を取得する。文書の取得は、異なるエリア名ごとに行う。 The document acquisition unit 11 accesses the servers 2, 3,... Connected to the network 1 such as the Internet, and uses a geographical area name (for example, Shibuya, Akihabara, etc.) as a key, and includes a plurality of the area names. Get the document (text data). Document acquisition is performed for each different area name.

文書保持部12は、文書取得部11により取得された複数の文書をエリア名ごとに保持する。以下では、各エリア名で取得された各文書群を解析対象エリア文書群と称し、それらの全体を解析対象文書群と称する。 The document holding unit 12 holds a plurality of documents acquired by the document acquisition unit 11 for each area name. Hereinafter, each document group acquired with each area name is referred to as an analysis target area document group, and all of them are referred to as an analysis target document group.

文書フィルタ部13は、文書取得部11により取得された複数の文書のうち、解析対象としない文書や記載領域を排除する。例えば、解析対象エリア文書群内の同一文書(重複)を排除したり、解析対象文書群内の文書におけるエリア名が記載されている記載領域(例えば、エリア名が含まれる文の前１〜２文ないし後１〜２文の領域)を抽出したり、名詞や未知語の割合が高い記載領域(名詞の割合が極めて高い文書領域は、人名や地名などが単に羅列されている領域と推定される。)を削除したりする。文書フィルタ部13を通して得られる文書を解析対象文書群とする。なお、異なるエリア名で取得された文書は、同一文書であっても異なる文書として取り扱う。 The document filter unit 13 excludes documents and description areas that are not to be analyzed from among a plurality of documents acquired by the document acquisition unit 11. For example, the same document (overlapping) in the analysis target area document group is excluded, or a description area in which the area name in the document in the analysis target document group is described (for example, 1 to 2 before the sentence including the area name) (Regions of sentences or later 1-2 sentences) or a description region with a high proportion of nouns or unknown words (a document region with a very high proportion of nouns is presumed to be a region where personal names or place names are simply listed. Delete). Documents obtained through the document filter unit 13 are set as analysis target document groups. Note that documents acquired with different area names are handled as different documents even if they are the same document.

形態素解析部14は、文書フィルタ部13を通して得られる解析対象文書群の各文書を品詞ごとに分解する。文書を品詞ごとに分解する手法は、特定の手法に限られるものではなく、いかなる手法でもよい。この手法は、既知であるので、説明は省略する。 The morphological analysis unit 14 decomposes each document of the analysis target document group obtained through the document filter unit 13 for each part of speech. The method of decomposing a document for each part of speech is not limited to a specific method, and any method may be used. Since this method is known, description thereof is omitted.

単語出現文書数保持部15は、解析対象文書群に出現する個々の単語について、形態素解析部14で得られた品詞を参照して個々の単語が出現する文書数を解析対象エリア文書群ごとに保持する。 The word appearance document number holding unit 15 refers to the part of speech obtained by the morphological analysis unit 14 for each word appearing in the analysis target document group, and determines the number of documents in which each word appears for each analysis target area document group. Hold.

第１の重要度算出部16は、解析対象文書群に出現する個々の単語について、解析対象エリア文書群内の、文書総数に対する当該単語を含む文書数の割合を第１の寄与度として算出し、解析対象文書群内の、当該単語を含む文書数に対する、当該単語を含む文書群内で当該エリア名を含む文書数の割合を第２の寄与度として算出し、第１の寄与度および第２の寄与度から各単語の第１の重要度を各エリア名ごとに算出する。 The first importance calculation unit 16 calculates, for each word appearing in the analysis target document group, a ratio of the number of documents including the word to the total number of documents in the analysis target area document group as a first contribution. The ratio of the number of documents including the area name in the document group including the word to the number of documents including the word in the analysis target document group is calculated as the second contribution, and the first contribution and the first The first importance of each word is calculated for each area name from the contribution of 2.

第２の重要度算出部17は、解析対象文書群に出現する個々の単語について、解析対象文書群内の、文書総数に対する、当該単語を含む文書群内で当該エリア名を含む文書数の割合を各単語の第２の重要度として各エリア名ごとに算出する。 For each word appearing in the analysis target document group, the second importance calculation unit 17 calculates the ratio of the number of documents including the area name in the document group including the word to the total number of documents in the analysis target document group. Is calculated for each area name as the second importance of each word.

単語重要度保持部18は、第１の重要度および第２の重要度を、エリア名ごとに各単語と関連付けて保持する。 The word importance holding unit 18 holds the first importance and the second importance in association with each word for each area name.

話題語抽出部19は、単語重要度保持部18に保持された第１の重要度に基づき単語をランキングし、重要度が高い単語群を当該エリアに属する話題語として抽出する。これを提示すれば、各エリアの特徴的単語(話題語)あるいは単語群(話題語群)が分かる。 The topic word extraction unit 19 ranks words based on the first importance held in the word importance holding unit 18 and extracts a word group having a high importance as a topic word belonging to the area. If this is presented, a characteristic word (topic word) or a word group (topic word group) in each area can be known.

エリア間類似度算出部20は、第２の重要度の各エリア名ごとの集合を当該エリアの特徴ベクトルとし、エリア間での特徴ベクトルの類似度からエリア間類似度を算出する。エリア間類似度に基づいて類似単語群を持つ類似エリアを抽出でき、これを提示すれば、類似エリアが分かる。類似エリアは、類似度が設定値を超えているエリア、あるいは類似度が上位の一定個数のエリアを抽出することにより抽出できる。 The inter-area similarity calculation unit 20 sets a set for each area name of the second importance as the feature vector of the area, and calculates the inter-area similarity from the similarity of the feature vectors between the areas. A similar area having a similar word group can be extracted based on the similarity between areas, and the similar area can be found by presenting the similar area. The similar area can be extracted by extracting an area where the degree of similarity exceeds a set value, or a certain number of areas having higher degrees of similarity.

単語類似度保持部21は、類似エリア間について、エリア間類似度算出部20がエリア間類似度を算出する過程で得られる、特徴ベクトルの要素である各単語ごとの類似度(単語類似度)を保持する。 The word similarity holding unit 21 obtains a similarity (word similarity) for each word that is an element of a feature vector, which is obtained in the process in which the inter-area similarity calculation unit 20 calculates the inter-area similarity between similar areas. Hold.

類似語抽出部22は、単語類似度に基づいてエリア間の類似をもたらす単語あるいは単語群を抽出する。この単語(単語群)も、類似エリア間での単語ごとの類似度が設定値を超えている単語(単語群)、あるいは類似エリア間での単語ごとの類似度が上位の一定個数の単語(単語群)を抽出することにより抽出できる。この単語(単語群)提示すれば、類似エリアがどの単語(単語群)で類似しているかが分かる。 The similar word extraction unit 22 extracts words or word groups that cause similarity between areas based on the word similarity. This word (word group) is also a word (word group) in which the similarity for each word between similar areas exceeds the set value, or a certain number of words with the highest similarity for each word between similar areas ( It can be extracted by extracting a word group. If this word (word group) is presented, it can be understood which word (word group) the similar area is similar to.

以下に、第１の重要度算出部16、第２の重要度算出部17、エリア類似度算出部20および類似語抽出部22の処理を具体的に説明する。以下の各記号の意味は、次の通りであり、図２は、各文書数の関係を示す。
ω_ｋ：対象語(エリア名)(1≦k≦K(Kはエリア名総数))。
D(ω_ｋ)：ω_ｋで検索された文書数(解析対象エリア文書群内の文書総数)。
D：全てのω_ｋで検索された文書数(解析対象文書群内の文書総数)。
e_ｎ：解析対象文書群内の文書に含まれる単語(1≦n≦N(Nは単語総数))。
D(e_ｎ,ω_ｋ)：解析対象エリア文書群内の文書中で、e_ｎを含む文書数。
D(e_ｎ)：解析対象文書群内の文書中で、e_ｎを含む文書数。
D(ω_ｋ,e_ｎ)：解析対象文書群内の文書内で、e_ｎを含む文書中のω_ｋを含む文書数。
S(k,k+1)：エリアk,k+1の類似度 Hereinafter, the processes of the first importance calculation unit 16, the second importance calculation unit 17, the area similarity calculation unit 20, and the similar word extraction unit 22 will be specifically described. The meanings of the following symbols are as follows, and FIG. 2 shows the relationship between the numbers of documents.
ω _k : Target word (area name) (1 ≦ k ≦ K (K is the total number of area names)).
D (ω _k ): The number of documents searched with ω _k (the total number of documents in the analysis target area document group).
D: Number of documents searched for all ω _k (total number of documents in the analysis target document group).
e _n : Words included in documents in the analysis target document group (1 ≦ n ≦ N (N is the total number of words)).
_{_{D (e n, ω k)}} : in the document in the analyzed area document group, the number of documents that contain e _n.
D (e _n): in the document in the analyzed document group, the number of documents that contain e _n.
D (ω _k , e _n ): The number of documents including ω _k in the documents including e _n in the documents in the analysis target document group.
S (k, k + 1): Similarity of area k, k + 1

第１の重要度算出部16は、第１の寄与度α(k,n)および第２の寄与度β(n,k)を算出し、これらから各単語の重要度(第１の重要度)γ(k,n)を各エリア名ごとに算出する。 The first importance degree calculation unit 16 calculates the first contribution degree α (k, n) and the second contribution degree β (n, k), and the importance degree (first importance degree) of each word from these. ) γ (k, n) is calculated for each area name.

第１の寄与度α(k,n)は、解析対象エリア文書群内の文書中でのe_ｎの重要度を表すものであり、式(1)で示すように、D(ω_ｋ)に対するD(e_ｎ,ω_ｋ)の割合として算出される。 The first contribution alpha (k, n) is representative of the importance of the e _n in a document in the analysis target area documents, as shown in equation (1), for D (omega _k) Calculated as the ratio of D (e _n , ω _k ).

第２の寄与度β(n,k)は、解析対象文書群内で、e_ｎを含む文書中でのω_ｋの重要度を表すものでり、式(2)に示すように、D(e_ｎ)に対するD(ω_ｋ,e_ｎ)の割合として算出される。
The second contribution beta (n, k) is within the analyzed documents, deli represents the importance of the omega _k in a document containing e _n, as shown in equation (2), D ( for e _n) is calculated as the ratio of _{_{D (ω k, e n)}} .

第１の重要度γ(k,n)は、第１の寄与度α(k,n)と第２の寄与度β(n,k)の積として算出され、D(e_ｎ,ω_ｋ)＝D(ω_ｋ,e_ｎ)であるので、式(3)で示される。 The first importance γ (k, n) is calculated as a product of the first contribution α (k, n) and the second contribution β (n, k), and D (e _n , ω _k ). Since = D (ω _k , e _n ), it is expressed by equation (3).

第１の寄与度α(k,n)は、文書数の割合として算出され、特定の文書内に特定の単語が繰り返し出現してもその値は大きくならないので、その影響を低減できる。また、第２の寄与度β(n,k)は、各エリア間での相対的な単語の重要度を表しており、これも文書数の割合として算出され、全エリアで一般的に使用される単語に対してその値は小さくなるので、その影響を低減できる。したがって、式(3)により、各エリアにおける各単語の重要度γ(k,n)を適切に算出できる。 The first contribution degree α (k, n) is calculated as a ratio of the number of documents, and even if a specific word repeatedly appears in a specific document, the value does not increase, so that the influence can be reduced. The second contribution β (n, k) represents the relative importance of words in each area, and is also calculated as a ratio of the number of documents and is generally used in all areas. Since the value is smaller for a word, the influence can be reduced. Therefore, the importance γ (k, n) of each word in each area can be appropriately calculated by Equation (3).

第２の重要度算出部17は、各単語の重要度(第２の重要度)η(k,n)をエリア名ごとに算出する。第２の重要度η(k,n)は、式(4)に示すように、解析対象文書群に出現する個々の単語について、解析対象文書群内の、文書総数Dに対する、当該単語を含む文書群内で当該エリア名を含む文書数D(ω_ｋ,e_ｎ)の割合として算出される。各単語の第２の重要度η(k,n)の各エリア名ごとの集合を当該エリアの特徴ベクトルとする。 The second importance calculation unit 17 calculates the importance (second importance) η (k, n) of each word for each area name. The second importance η (k, n) includes, for each word appearing in the analysis target document group, the word with respect to the total number D of documents in the analysis target document group, as shown in Expression (4). It is calculated as a ratio of the number of documents D (ω _k , e _n ) including the area name in the document group. A set for each area name of the second importance degree η (k, n) of each word is set as a feature vector of the area.

式(4)で算出される第２の重要度η(k,n)は、式(2)で示される第２の寄与度β(n,k)と式(5)で示される第３の寄与度θ(n)の積β(n,k)×θ(n)と同じであるので、β(n,k)×θ(n)から算出することもできる。第３の寄与度θ(n)は、全てのω_ｋで検索された文書でのe_ｎを含む文書の重要度を表し、Dに対するD(e_ｎ)の割合である。 The second importance η (k, n) calculated by the equation (4) is the second contribution β (n, k) expressed by the equation (2) and the third importance expressed by the equation (5). Since it is the same as the product β (n, k) × θ (n) of the contribution degree θ (n), it can also be calculated from β (n, k) × θ (n). The third contribution theta (n) represents the importance of documents including e _n in documents retrieved in all omega _k, a ratio of D (e _n) for D.

ここで、第２の寄与度β(n,k)には、エリアで共起性が高い単語に対して高い値を与える特性があり、第３の寄与度θ(n)には、解析対象文書群内の総文書内で使用頻度の高い単語に対して高い値を与える特性がある。したがって、第２の重要度η(k,n)の集合は、エリア間類似度を算出するための有効な指標(特徴ベクトル)となる。 Here, the second contribution β (n, k) has a characteristic of giving a high value to words having high co-occurrence in the area, and the third contribution θ (n) has an analysis target. There is a characteristic in which a high value is given to a frequently used word in the total documents in the document group. Therefore, the set of the second importance η (k, n) is an effective index (feature vector) for calculating the similarity between areas.

エリア間類似度算出部20は、式(6)により、エリアk,k+1間の類似度S(k,k+1)を算出する。式(6)は、エリアkとエリアk+1の間で、同じ単語n同士の第２の重要度の差分絶対値｜η(k+1,n)-η(k,n)｜を全ての単語(1≦n≦N)について算出し、それらを加算することで、エリアk,k+1間の類似度(距離)を求めること、すなわち、各エリアの特徴ベクトル(第２の重要度の集合)の類似度をエリアk,k+1間の類似度S(k,k+1)とすることを示している。図３は、この関係を示す。式(6)の類似度S(k,k+1)は、エリア間の特徴ベクトルが類似する程、小さい値となる。 The inter-area similarity calculation unit 20 calculates the similarity S (k, k + 1) between the areas k and k + 1 by the equation (6). Equation (6) shows that the difference absolute values | η (k + 1, n) −η (k, n) | of the second importance of the same word n are all between the area k and the area k + 1. To calculate the similarity (distance) between areas k and k + 1 by adding them, that is, the feature vector of each area (second importance) It is shown that the similarity degree of the set k) is the similarity degree S (k, k + 1) between the areas k and k + 1. FIG. 3 illustrates this relationship. The similarity S (k, k + 1) in Equation (6) becomes a smaller value as the feature vectors between areas are similar.

類似語抽出部22は、類似エリア間について、特徴ベクトルの要素である各単語ごとの類似度に基づいて類似単語あるいは単語群を抽出する。各単語ごとの類似度は、類似度算出部20がエリア間類似度を算出する過程で、第２の重要度の差分絶対値｜η(k+1,n)-η(k,n)｜として既に算出されている。類似語抽出部22では、その差分絶対値に基づいて類似単語(単語群)を抽出できる。この単語(単語群)は、差分絶対値｜η(k+1,n)-η(k,n)｜が設定値より小さい単語(単語群)、あるいは差分絶対値｜η(k+1,n)-η(k,n)｜が最小の単語から一定個数の単語(単語群)を抽出することにより抽出できる。 The similar word extraction unit 22 extracts similar words or word groups between similar areas based on the similarity for each word that is an element of the feature vector. The similarity for each word is obtained by calculating the second importance difference absolute value | η (k + 1, n) −η (k, n) | in the process in which the similarity calculation unit 20 calculates the similarity between areas. As already calculated. The similar word extraction unit 22 can extract similar words (word groups) based on the absolute difference. This word (word group) has a difference absolute value | η (k + 1, n) −η (k, n) | smaller than a set value or a difference absolute value | η (k + 1, n) −η (k, n) | can be extracted by extracting a certain number of words (word group) from words having the smallest value.

以上、実施形態を説明したが、本発明は、上記実施形態に限定されるものではない。例えば、地理的特徴情報抽出システムは、文書取得部により収集された文書から各エリアの地理的特徴を表す単語(話題語)を抽出する構成を備えるだけでもよい。これに加えて、エリア間類似度算出部や類似語抽出部を備えれば、類似エリアやその間の類似をもたらす単語(話題語)を抽出できるので、ユーザにとって有益な地理的特徴情報を提供することができる。 Although the embodiment has been described above, the present invention is not limited to the above embodiment. For example, the geographic feature information extraction system may only have a configuration for extracting words (topic words) representing geographic features of each area from a document collected by the document acquisition unit. In addition to this, if an inter-area similarity calculation unit and a similar word extraction unit are provided, it is possible to extract similar areas and words (topic words) that bring about similarities between them, providing geographical feature information useful to the user. be able to.

本発明に係る地理的特徴情報抽出システムの一実施形態を示すブロック図である。It is a block diagram which shows one Embodiment of the geographical feature information extraction system which concerns on this invention. 各文書数の関係を示す説明図である。It is explanatory drawing which shows the relationship of each document number. エリア間類似度の算出を示す説明図である。It is explanatory drawing which shows calculation of the similarity between areas.

Explanation of symbols

1・・・ネットワーク、2,3・・・サーバ、10・・・地理的特徴情報抽出システム、11・・・文書取得部、12・・・文書保持部、13・・・文書フィルタ部、14・・・形態素解析部、15・・・単語出現文書数保持部、16・・・第１の重要度算出部、17・・・第２の重要度算出部、18・・・単語重要度保持部、19・・・話題語抽出部、20・・・エリア間類似度算出部、21・・・単語類似度保持部、22・・・類似語抽出部 DESCRIPTION OF SYMBOLS 1 ... Network, 2,3 ... Server, 10 ... Geographic feature information extraction system, 11 ... Document acquisition part, 12 ... Document holding part, 13 ... Document filter part, 14 ... morphological analysis unit, 15 ... word appearance document number holding unit, 16 ... first importance calculation unit, 17 ... second importance calculation unit, 18 ... word importance holding , 19 ... topic word extraction part, 20 ... inter-area similarity calculation part, 21 ... word similarity holding part, 22 ... similar word extraction part

Claims

A first step of acquiring a plurality of documents including the area name using a geographical area name as a key;
A second step in which all documents acquired in the first step are set as an analysis target document group, and data of each document of the analysis target document group is decomposed into parts of speech;
Documents acquired for each area name are set as analysis target area document groups, and individual words appearing in the analysis target document groups with reference to the part of speech obtained in the second step. A third step of holding the number of documents to be analyzed for each analysis target area document group;
For each word appearing in the analysis target document group, a ratio of the number of documents including the word to the total number of documents in the analysis target area document group is calculated as a first contribution, The ratio of the number of documents including the area name in the document group including the word to the number of documents including the word is calculated as a second contribution, and the first contribution and the second contribution are calculated. A fourth step of calculating the first importance of each word for each area name;
A geographical feature information extraction method comprising: a fifth step of extracting a word or word group having a high first importance as a topic word belonging to the area.

Further, for each word appearing in the analysis target document group, the ratio of the number of documents including the area name in the document group including the word to the total number of documents in the analysis target document group is set to the second of each word. A sixth step for calculating the importance of each area name;
A seventh step of calculating the second importance set for each area name as a feature vector of the area and calculating a similarity of feature vectors between the areas as an inter-area similarity;
The geographical feature information extraction method according to claim 1, further comprising an eighth step of extracting a similar area based on the similarity between the areas.

Furthermore, for similar areas, a ninth step of holding the similarity for each word that is an element of the feature vector, obtained in the process of calculating the similarity between areas in the seventh step;
3. The geographical feature information extraction method according to claim 2, further comprising a tenth step of extracting a similar word or a similar word group between similar areas based on the similarity for each word.

A document acquisition unit that acquires a plurality of documents including the area name using a geographical area name as a key;
All the documents acquired by the document acquisition unit are set as analysis target document groups, and a morphological analysis unit that decomposes data of each document of the analysis target document group into parts of speech;
A document acquired for each area name is set as an analysis target area document group, and for each word appearing in the analysis target document group, referring to the part of speech obtained by the morphological analysis unit, the analysis target area document group A word appearance document number holding unit for holding the number of documents in which individual words appear for each analysis target area document group;
For each word appearing in the analysis target document group, a ratio of the number of documents including the word to the total number of documents in the analysis target area document group is calculated as a first contribution, The ratio of the number of documents including the area name in the document group including the word to the number of documents including the word is calculated as a second contribution, and the first contribution and the second contribution are calculated. A first importance calculation unit for calculating the first importance of each word for each area name;
A geographical feature information extraction system comprising: a topic word extraction unit that extracts a word or word group having a high first importance as a topic word belonging to the area.

Further, for each word appearing in the analysis target document group, the ratio of the number of documents including the area name in the document group including the word to the total number of documents in the analysis target document group is set to the second of each word. A second importance calculation unit for calculating the importance of each area name,
An inter-area similarity calculating unit that calculates the second importance set for each area name as a feature vector of the area and calculates the similarity of the feature vectors between the areas as an inter-area similarity;
The geographical feature information extraction system according to claim 4, further comprising a similar area extraction unit that extracts a similar area based on the similarity between the areas.

The second importance calculation unit calculates a ratio of the number of documents including the word to the total number of documents in the analysis target document group as a third contribution, the second contribution, and the first contribution 6. The geographical feature information extraction system according to claim 5, further comprising means for calculating a product of contributions of 3 as the second importance.

Further, for similar areas, a word similarity holding unit that holds similarity for each word that is an element of the feature vector, obtained in the process of calculating the similarity between areas by the inter-area similarity calculating unit;
The geographical feature information extraction system according to claim 5 or 6, further comprising a similar word extraction unit that extracts a similar word or a similar word group between similar areas based on the similarity for each word. .

5. The document of the analysis target area document group is provided with a document filter unit that eliminates at least duplication, and all analysis target area document groups obtained through the document filter unit are set as the analysis target document group. The geographical feature information extraction system according to any one of Items 7 to 7.