JP2016045938A

JP2016045938A - Comment grouping program, server and method extracting area specific comments from many comments

Info

Publication number: JP2016045938A
Application number: JP2015140253A
Authority: JP
Inventors: 一則松本; Kazunori Matsumoto; 滝嶋　康弘; Yasuhiro Takishima; 康弘滝嶋; 服部　元; Hajime Hattori; 元服部
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2014-08-20
Filing date: 2015-07-14
Publication date: 2016-04-04
Anticipated expiration: 2035-07-14
Also published as: JP6497657B2

Abstract

PROBLEM TO BE SOLVED: To provide a program and the like that can extract area specific comments from large amounts of comments in which comments with positional information and comments without positional information are mixed.SOLUTION: A comment grouping program comprises: topic grouping means which groups a lot of comments into a plurality of topic groups based on the distribution of appearance frequency of the word and applies topic number to respective comments; positional information added comment grouping means which groups a plurality of positional information added comments into the prescribed area ranges respectively where the positional information belongs; area specific topic number extraction means which extracts a most common topic number among the positional information added comments as an area specific topic number for each of the prescribed area ranges by use of the prescribed standard based on the maldistribution degree; and area specific comment grouping means which groups the positional information added comments and the comments without positional information on the basis of the area specific topic number of the respective area ranges into the area specific comments.SELECTED DRAWING: Figure 2

Description

本発明は、不特定多数の第三者から投稿されるコメントを分類する技術に関する。 The present invention relates to a technique for classifying comments posted from an unspecified number of third parties.

インターネット上には、ブログ(Web log)サーバやミニブログ(mini Web log)（例えばtwitter（登録商標））サーバが接続されている。このようなブログサーバは、不特定多数の第三者からのコメント（投稿文）を受信し、他の第三者へ公開する。これらコメントは、時事的に発生する様々な話題について、公開を目的として記述されている。例えば、交通渋滞、電車の運休や遅れなど、投稿場所に密接に関わる地域的な話題について記述されていることも多い。これら投稿文は、例えば、コミックマーケットや花火大会など各地のイベント、また、風水害の状況等をほぼリアルタイムで発信する場合も多い。これら投稿文は、地方自治体や官公庁及びマスメディアから提供される情報を補う可能性もあり、現場の詳細な状況把握媒体として注目されている。 A blog (Web log) server and a mini blog (for example, twitter (registered trademark)) server are connected to the Internet. Such a blog server receives comments (posts) from an unspecified number of third parties and publishes them to other third parties. These comments are written for the purpose of publishing various topics that occur in current affairs. For example, it often describes regional topics closely related to the posting location, such as traffic jams, suspension or delays in trains. These posted sentences often transmit events in various places such as comic market and fireworks display, and the situation of storm and flood damage in almost real time. These posted texts may be supplemented with information provided by local governments, government offices and mass media, and are attracting attention as detailed on-site situation grasp media.

従来、位置情報が付与されたコメント（文章）に対してのみトピック分類を実行し、地域固有のトピックや用語を抽出する技術がある（例えば非特許文献１参照）。この技術によれば、多数の位置情報付きコメントから、その地域固有の話題を抽出し、様々な社会現象を発見することができるようにしたものである。 Conventionally, there is a technique for performing topic classification only on comments (sentences) to which position information is given, and extracting local topics and terms (for example, see Non-Patent Document 1). According to this technology, topics unique to the region are extracted from a large number of comments with position information, and various social phenomena can be discovered.

J. Eisenstein, et.al. A Latent Variable Model for Geographic Lexical Variation, EMNLP2010、[online]、［平成２６年８月１６日検索］、インターネット＜URL:http://www.cs.cmu.edu/~nasmith/papers/eisenstein+oconnor+smith+xing.emnlp10.pdf＞J. Eisenstein, et.al. A Latent Variable Model for Geographic Lexical Variation, EMNLP2010, [online], [searched August 16, 2014], Internet <URL: http://www.cs.cmu.edu/ ~ nasmith / papers / eisenstein + oconnor + smith + xing.emnlp10.pdf> D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993-1022, 2003、[online]、［平成２６年８月１６日検索］、インターネット＜URL:http://machinelearning.wustl.edu/mlpapers/paper_files/BleiNJ03.pdf＞DM Blei, AY Ng, and MI Jordan. Journal of Machine Learning Research, 3: 993-1022, 2003, [online], [Search August 16, 2014], Internet <URL: http: / /machinelearning.wustl.edu/mlpapers/paper_files/BleiNJ03.pdf>

しかしながら、全体のコメントの数に対して、位置情報付きコメントの数は、わずか１％にも満たない。少数の投稿者のみから発信されたコメントの中から、地域固有の話題を抽出した場合、その情報自体がかなり偏在したものとなる恐れがある。実際に、非特許文献１に記載の技術によれば、地域固有の俗語や地元野球チームの名称は抽出できても、その地域で時事的に発生している交通渋滞などの話題を抽出することは難しい。 However, the number of comments with position information is less than 1% of the total number of comments. If a topic specific to a region is extracted from comments sent from only a small number of contributors, the information itself may be unevenly distributed. In fact, according to the technique described in Non-Patent Document 1, even if the slang words specific to the region and the names of the local baseball teams can be extracted, topics such as traffic jams that are currently occurring in the region can be extracted. Is difficult.

これに対し、本願の発明者らは、同じ地域範囲から投稿されたコメントは、位置情報付き又は位置情報無しに拘わらず、その話題は類似したものとなると考えた。即ち、時事的に発生するコメントにおける話題の類似性に応じて、地域範囲を区別することができるのではないか？と考えた。 On the other hand, the inventors of the present application considered that the comments posted from the same area range would be similar regardless of whether or not the location information is included. In other words, isn't it possible to distinguish regional ranges according to topical similarity in comments that occur in current affairs? I thought.

また、本願の発明者らは、その地域範囲に偏在した突発的な特定話題の投稿文のみを抽出することが重要であると考えた。即ち、その地域範囲で恒常的な話題の投稿文の重要性は低いと考えた。例えば、成田市の場合、海外旅行についての投稿文が恒常的に多いために、例えば突発的に発生した災害等に関する投稿文が埋もれてしまうという問題もある。 In addition, the inventors of the present application considered that it is important to extract only posted texts of sudden specific topics that are unevenly distributed in the region. In other words, it was thought that the importance of the postings on the topic in that area range is low. For example, in the case of Narita City, there is a problem that a posted text regarding a disaster that occurred suddenly is buried, for example, because there are always a large number of posted texts about overseas travel.

そこで、本発明は、位置情報付きコメントと位置情報無しコメントとが混在する大量のコメントから、地域固有コメントを抽出することができるコメント分類プログラム、サーバ及び方法を提供することを目的とする。また、本発明は、その地域範囲で発生した突発的な投稿文のみを抽出することも目的とする。 Therefore, an object of the present invention is to provide a comment classification program, server, and method capable of extracting region-specific comments from a large number of comments in which comments with position information and comments without position information are mixed. Another object of the present invention is to extract only unexpected posted sentences that have occurred in the area.

本発明によれば、位置情報付きコメント（文章）と位置情報無しコメント（文章）とが混在する多数のコメントから、地域固有コメントを抽出するようにコンピュータを機能させるコメント分類プログラムであって、
多数のコメントを、単語の出現頻度の分布に基づいて複数個のトピックグループに分類し、各コメントにトピック番号を付与するトピック分類手段と、
複数の位置情報付きコメントを、当該位置情報が属する所定地域範囲毎に分類する位置情報付きコメント分類手段と、
所定の地域範囲毎に、偏在度に基づく所定規準を用いて、複数の位置情報付きコメントの中で、最も多いトピック番号を地域固有トピック番号として抽出する地域固有トピック番号抽出手段と、
各地域範囲の地域固有トピック番号に基づいて、位置情報付きコメント及び位置情報無しコメントを、地域固有コメントとして分類する地域固有コメント分類手段と
してコンピュータを機能させることを特徴とする。 According to the present invention, a comment classification program that causes a computer to function to extract region-specific comments from a large number of comments in which comments with position information (sentences) and comments without position information (sentences) are mixed,
A topic classification means for classifying a large number of comments into a plurality of topic groups based on the distribution of word frequencies, and assigning a topic number to each comment;
Comment classification means with position information for classifying a plurality of comments with position information for each predetermined area range to which the position information belongs;
An area-specific topic number extracting means for extracting the most topic number as an area-specific topic number among a plurality of comments with position information using a predetermined criterion based on the uneven distribution degree for each predetermined area range;
The computer is caused to function as an area-specific comment classification unit that classifies comments with position information and comments without position information as area-specific comments based on the area-specific topic number of each area range.

本発明のコメント分類プログラムにおける他の実施形態によれば、
トピック分類手段は、各コメントを、各トピックグループに属する確からしさ（トピック比率）を算出するＬＤＡ(Latent Dirichlet Allocation)アルゴリズムを用いて、いずれか１つのトピックグループに分類する
ようにコンピュータを機能させることも好ましい。 According to another embodiment of the comment classification program of the present invention,
The topic classification means causes the computer to function so as to classify each comment into any one topic group using an LDA (Latent Dirichlet Allocation) algorithm that calculates the probability (topic ratio) belonging to each topic group. Is also preferable.

本発明のコメント分類プログラムにおける他の実施形態によれば、
地域固有トピック番号抽出手段における偏在度に基づく所定規準は、赤池情報量基準である
ようにコンピュータを機能させることも好ましい。 According to another embodiment of the comment classification program of the present invention,
It is also preferable to make the computer function so that the predetermined criterion based on the uneven distribution degree in the region-specific topic number extracting means is the Akaike information amount criterion.

本発明のコメント分類プログラムにおける他の実施形態によれば、
所定期間毎に、地域固有コメントを記憶する期間別コメント記憶手段と、
各地域範囲の各地域固有トピック番号に基づく地域固有コメントの中で、偏在度に基づく所定規準を用いて、所定数（又は所定割合）以上多く出現する頻出単語を抽出する頻出単語抽出手段と、
現期間の頻出単語と、所定の過去期間の頻出単語とを比較して、所定割合以上同じ頻出単語が抽出される地域固有トピック番号を、恒常的トピック番号と判定する恒常的トピック番号抽出手段と、
地域固有コメントの中で、恒常的トピック番号に基づくコメントを削除する恒常的コメント削除手段と
としてコンピュータを更に機能させることも好ましい。 According to another embodiment of the comment classification program of the present invention,
Comment storage means by period for storing region-specific comments for each predetermined period;
A frequent word extracting means for extracting frequent words that appear more than a predetermined number (or a predetermined ratio) using a predetermined criterion based on the uneven distribution degree in a region-specific comment based on each region-specific topic number of each region range;
A constant topic number extracting means for comparing a frequent word in a current period with a frequent word in a predetermined past period and determining, as a constant topic number, a region-specific topic number from which the same frequent word is extracted by a predetermined ratio or more. ,
It is also preferable to further cause the computer to function as a permanent comment deleting means for deleting a comment based on a permanent topic number in the region-specific comments.

本発明のコメント分類プログラムにおける他の実施形態によれば、
頻出単語抽出手段における偏在度に基づく所定規準は、赤池情報量基準である
ようにコンピュータを機能させることも好ましい。 According to another embodiment of the comment classification program of the present invention,
It is also preferable to make the computer function so that the predetermined criterion based on the uneven distribution degree in the frequent word extraction means is the Akaike information amount criterion.

本発明のコメント分類プログラムにおける他の実施形態によれば、
トピック分類手段は、位置情報付きコメント及び位置情報無しコメントの述語項構造を解析し、伝聞推定表現を含むコメントを除去する
ようにコンピュータを機能させることも好ましい。 According to another embodiment of the comment classification program of the present invention,
The topic classification means preferably analyzes the predicate term structure of the comment with position information and the comment without position information, and causes the computer to function so as to remove the comment including the hearing estimation expression.

本発明のコメント分類プログラムにおける他の実施形態によれば、
隣接する両方の地域範囲における地域固有トピック番号が同一である場合、両方の地域範囲をクラスタリングして、１つの地域範囲として地域固有のコメントを統合する地域範囲クラスタリング手段と
してコンピュータを更に機能させることも好ましい。 According to another embodiment of the comment classification program of the present invention,
If the region-specific topic numbers in both adjacent region ranges are the same, the computer may further function as a region range clustering means for clustering both region ranges and integrating region-specific comments as one region range. preferable.

本発明のコメント分類プログラムにおける他の実施形態によれば、
コメントは、不特定多数の第三者によって、ミニブログ(mini Web log)サーバに投稿されたものである
ようにコンピュータを機能させることも好ましい。 According to another embodiment of the comment classification program of the present invention,
It is also preferable to make the computer function so that the comment is posted on a mini Web log server by an unspecified number of third parties.

本発明によれば、位置情報付きコメント（文章）と位置情報無しコメント（文章）とが混在する多数のコメントを収集し、地域固有コメントを抽出するコメント分類サーバであって、
多数のコメントを、単語の出現頻度の分布に基づいて複数個のトピックグループに分類し、各コメントにトピック番号を付与するトピック分類手段と、
複数の位置情報付きコメントを、当該位置情報が属する所定地域範囲毎に分類する位置情報付きコメント分類手段と、
所定の地域範囲毎に、偏在度に基づく所定規準を用いて、複数の位置情報付きコメントの中で、最も多いトピック番号を地域固有トピック番号として抽出する地域固有トピック番号抽出手段と、
各地域範囲の地域固有トピック番号に基づいて、位置情報付きコメント及び位置情報無しコメントを、地域固有コメントとして分類する地域固有コメント分類手段と
を有することを特徴とする。 According to the present invention, a comment classification server that collects a large number of comments in which comments (sentences) with position information and comments (sentences) without position information are mixed, and extracts region-specific comments,
A topic classification means for classifying a large number of comments into a plurality of topic groups based on the distribution of word frequencies, and assigning a topic number to each comment;
Comment classification means with position information for classifying a plurality of comments with position information for each predetermined area range to which the position information belongs;
An area-specific topic number extracting means for extracting the most topic number as an area-specific topic number among a plurality of comments with position information using a predetermined criterion based on the uneven distribution degree for each predetermined area range;
A region-specific comment classification unit that classifies comments with position information and comments without position information as region-specific comments based on the region-specific topic number of each region range.

本発明のコメント分類サーバにおける他の実施形態によれば、
所定期間毎に、地域固有コメントを記憶する期間別コメント記憶手段と、
各地域範囲の各地域固有トピック番号に基づく地域固有コメントの中で、偏在度に基づく所定規準を用いて、所定数（又は所定割合）以上多く出現する頻出単語を抽出する頻出単語抽出手段と、
現期間の頻出単語と、所定の過去期間の頻出単語とを比較して、所定割合以上同じ頻出単語が抽出される地域固有トピック番号を、恒常的トピック番号と判定する恒常的トピック番号抽出手段と、
地域固有コメントの中で、恒常的トピック番号に基づくコメントを削除する恒常的コメント削除手段と
を更に有することも好ましい。 According to another embodiment of the comment classification server of the present invention,
Comment storage means by period for storing region-specific comments for each predetermined period;
A frequent word extracting means for extracting frequent words that appear more than a predetermined number (or a predetermined ratio) using a predetermined criterion based on the uneven distribution degree in a region-specific comment based on each region-specific topic number of each region range;
A constant topic number extracting means for comparing a frequent word in a current period with a frequent word in a predetermined past period and determining, as a constant topic number, a region-specific topic number from which the same frequent word is extracted by a predetermined ratio or more. ,
It is also preferable to further include a permanent comment deleting means for deleting a comment based on the permanent topic number in the region-specific comments.

本発明によれば、位置情報付きコメント（文章）と位置情報無しコメント（文章）とが混在する多数のコメントを収集し、地域固有コメントを抽出する装置のコメント分類方法であって、
多数のコメントを、単語の出現頻度の分布に基づいて複数個のトピックグループに分類し、各コメントにトピック番号を付与する第１のステップと、
複数の位置情報付きコメントを、当該位置情報が属する所定地域範囲毎に分類する第２のステップと、
所定の地域範囲毎に、偏在度に基づく所定規準を用いて、複数の位置情報付きコメントの中で、最も多いトピック番号を地域固有トピック番号として抽出する第３のステップと、
各地域範囲の地域固有トピック番号に基づいて、位置情報付きコメント及び位置情報無しコメントを、地域固有コメントとして分類する第４のステップと
を有することを特徴とする装置のコメント分類方法。 According to the present invention, there is a comment classification method for an apparatus that collects a large number of comments in which comments with a position information (sentence) and comments without a position information (sentence) are mixed, and extracts region-specific comments,
A first step of classifying a number of comments into a plurality of topic groups based on a distribution of word appearance frequencies, and assigning a topic number to each comment;
A second step of classifying a plurality of comments with position information for each predetermined area range to which the position information belongs;
A third step of extracting the largest topic number as a region-specific topic number among a plurality of comments with position information using a predetermined criterion based on the degree of uneven distribution for each predetermined region range;
A comment classification method for an apparatus, comprising: a fourth step of classifying comments with position information and comments without position information as area-specific comments based on the area-specific topic number of each area range.

本発明のコメント分類方法における他の実施形態によれば、
所定期間毎に、地域固有コメントを記憶する第５のステップと、
各地域範囲の各地域固有トピック番号に基づく地域固有コメントの中で、偏在度に基づく所定規準を用いて、所定数（又は所定割合）以上多く出現する頻出単語を抽出する第６のステップと、
現期間の頻出単語と、所定の過去期間の頻出単語とを比較して、所定割合以上同じ頻出単語が抽出される地域固有トピック番号を、恒常的トピック番号と判定する第７のステップと、
地域固有コメントの中で、恒常的トピック番号に基づくコメントを削除する第８のステップと
を更に有することも好ましい。 According to another embodiment of the comment classification method of the present invention,
A fifth step of storing region-specific comments for each predetermined period;
A sixth step of extracting frequent words appearing more than a predetermined number (or a predetermined ratio) using a predetermined criterion based on the uneven distribution in the region-specific comments based on each region-specific topic number of each region range;
A seventh step of comparing the frequent words of the current period with the frequent words of a predetermined past period, and determining, as a permanent topic number, a region-specific topic number from which the same frequent words are extracted by a predetermined ratio or more;
It is also preferable to further include an eighth step of deleting a comment based on the permanent topic number among the region-specific comments.

本発明のコメント分類プログラム、サーバ及び方法によれば、位置情報付きコメントと位置情報無しコメントとが混在する大量のコメントから、地域固有コメントを抽出することができる。また、本発明によれば、その地域範囲で発生した突発的な投稿文のみを抽出することもできる。 According to the comment classification program, server, and method of the present invention, it is possible to extract region-specific comments from a large number of comments in which comments with position information and comments without position information are mixed. In addition, according to the present invention, it is possible to extract only a sudden posted sentence that occurs in the area range.

本発明におけるシステム構成図である。It is a system configuration diagram in the present invention. 本発明におけるコメント分類サーバの機能構成図である。It is a functional block diagram of the comment classification | category server in this invention. コメント蓄積部に蓄積されたコメントの例を表す説明図である。It is explanatory drawing showing the example of the comment accumulate | stored in the comment storage part. トピック分類部におけるコメントの分類を表す説明図である。It is explanatory drawing showing the classification | category of the comment in a topic classification part. 位置情報付きコメントと地域固有トピック番号と表す説明図である。It is explanatory drawing represented with the comment with a positional information, and an area specific topic number. 地域固有トピック番号によって分類されたコメントを明示する説明図である。It is explanatory drawing which specifies the comment classified by the area specific topic number. 本発明におけるシーケンス図である。It is a sequence diagram in the present invention.

以下、本発明の実施の形態について、図面を用いて詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

図１は、本発明におけるシステム構成図である。 FIG. 1 is a system configuration diagram according to the present invention.

図１によれば、インターネット上に、本発明におけるコメント分類サーバ１が接続されている。コメント分類サーバ１は、不特定多数の第三者から投稿されたコメントを公開するブログサーバ２から、大量のコメントを収集する。コメント分類サーバ１は、位置情報付きコメント（文章）と位置情報無しコメント（文章）とが混在する多数のコメントから、地域範囲毎に、地域固有コメントとして分類することができる。
ブログサーバ２は、例えばtwitter（登録商標）サーバのようなミニブログサーバである。
端末３は、不特定多数の第三者によって所持され、ミニブログサーバ２へコメントを自由に投稿することができる。
端末４は、閲覧者によって操作され、アクセスネットワーク及びインターネットを介してコメント分類サーバ１へアクセスし、所望の地域範囲の地域固有コメントを受信することができる。 According to FIG. 1, a comment classification server 1 according to the present invention is connected to the Internet. The comment classification server 1 collects a large number of comments from the blog server 2 that publishes comments posted by an unspecified number of third parties. The comment classification server 1 can classify as a region-specific comment for each region range from a large number of comments in which comments (text) with position information and comments without position information (text) are mixed.
The blog server 2 is a mini blog server such as a twitter (registered trademark) server, for example.
The terminal 3 is owned by an unspecified number of third parties, and can freely post comments to the miniblog server 2.
The terminal 4 is operated by a viewer, can access the comment classification server 1 via the access network and the Internet, and can receive region-specific comments in a desired region range.

図２は、本発明におけるコメント分類サーバの機能構成図である。 FIG. 2 is a functional configuration diagram of the comment classification server in the present invention.

図２によれば、コメント分類サーバ１は、通信インタフェース部１０と、コメント収集部１０１と、コメント蓄積部１０２とを有する。
コメント収集部１０１は、ネットワークを介してブログサーバ２から、不特定多数の第三者から投稿されたコメントを受信し、コメント蓄積部１０２へ出力する。尚、コメント収集部１０１は、所望の用途に対応して、予め指定した複数のキーワードを含むコメントのみを収集するものであってよい。 As shown in FIG. 2, the comment classification server 1 includes a communication interface unit 10, a comment collection unit 101, and a comment storage unit 102.
The comment collection unit 101 receives comments posted from an unspecified number of third parties from the blog server 2 via the network, and outputs the comments to the comment storage unit 102. Note that the comment collection unit 101 may collect only comments including a plurality of keywords specified in advance corresponding to a desired application.

図３は、コメント蓄積部に蓄積されたコメントの例を表す説明図である。 FIG. 3 is an explanatory diagram illustrating an example of comments accumulated in the comment accumulation unit.

コメント蓄積部１０２は、コメント収集部１０１によって収集された大量のコメントを蓄積する。「コメント」とは、例えばtwitter（登録商標）で発信された、日本語の「つぶやき」（最大文字数：１４０文字）のようなものである。コメントは、少なくとも、ユーザid(from_user_id)、つぶやきＩＤ(id_str)、発信時刻(created_at)、つぶやき(texts)を含む。ここで、コメントは、「位置情報付き」であってもよいし、「位置情報無し」であってもよい。位置情報とは、一般に、コメント投稿者の端末におけるＧＰＳ(Global Positioning System)によって取得された緯度・経度情報である。尚、これに限られず、複数基地局測位に基づく位置情報であってもよいし、アクセスポイントに紐付けられた位置情報であってもよい。 The comment accumulation unit 102 accumulates a large amount of comments collected by the comment collection unit 101. The “comment” is, for example, a Japanese “tweet” (maximum number of characters: 140 characters) transmitted by twitter (registered trademark). The comment includes at least a user id (from_user_id), a tweet ID (id_str), a transmission time (created_at), and a tweet (texts). Here, the comment may be “with position information” or “without position information”. The position information is generally latitude / longitude information acquired by GPS (Global Positioning System) in the comment poster terminal. Note that the position information is not limited to this, and may be position information based on positioning by a plurality of base stations, or position information associated with an access point.

コメント蓄積部１０２は、大量のコメントを、発信時刻に応じて所定時間範囲毎のコメント群に区分しておくことも好ましい。例えば１時間毎であってもよい。これによって、１時間毎に、地域固有コメントの変化を知ることもできる。 The comment storage unit 102 preferably divides a large number of comments into comment groups for each predetermined time range according to the transmission time. For example, it may be every hour. As a result, it is possible to know changes in the region-specific comments every hour.

図２によれば、コメント分類サーバ１は更に、形態素解析部１１０と、トピック分類部１１１と、位置情報付きコメント分類部１１２と、地域固有トピック番号抽出部１１３と、地域固有コメント分類部１１４と、地域範囲クラスタリング部１１５とを有する。これら機能構成部は、サーバに搭載されたコンピュータを機能させるプログラムを実行することによって実現される。また、これら機能構成部の処理の流れは、装置のコメント分類方法としても理解できる。 According to FIG. 2, the comment classification server 1 further includes a morphological analysis unit 110, a topic classification unit 111, a comment classification unit with position information 112, a region-specific topic number extraction unit 113, and a region-specific comment classification unit 114. And an area range clustering unit 115. These functional components are realized by executing a program that causes a computer mounted on the server to function. Further, the processing flow of these functional components can be understood as a comment classification method of the apparatus.

［形態素解析部１１０］
形態素解析部１１０は、コメント蓄積部１０２の多数のコメントを、形態素解析によって形態素に区分する。「形態素解析」とは、文章を、意味のある単語に区切り、辞書を利用して品詞や内容を判別する技術をいう。「形態素」とは、文章の要素のうち、意味を持つ最小の単位を意味する。区分された形態素は、トピック分類部１１１へ出力される。 [Morphological analyzer 110]
The morpheme analysis unit 110 classifies a large number of comments in the comment storage unit 102 into morphemes by morpheme analysis. “Morphological analysis” refers to a technique in which sentences are divided into meaningful words and the part of speech and contents are discriminated using a dictionary. The “morpheme” means the smallest unit having meaning among the elements of the sentence. The classified morphemes are output to the topic classification unit 111.

［トピック分類部１１１］
トピック分類部１１１は、多数のコメントを、単語の出現頻度の分布に基づいて複数個のトピックグループに分類し、各コメントにトピック番号を付与する。 [Topic classification unit 111]
The topic classification unit 111 classifies a large number of comments into a plurality of topic groups based on the distribution of word appearance frequencies, and assigns a topic number to each comment.

＜トピック分類部１１１のＬＤＡ解析＞
トピック分類部１１１は、各コメントを、各トピックグループに属する確からしさ（トピック比率）を算出するＬＤＡ(Latent Dirichlet Allocation)アルゴリズムを用いて、いずれか１つのトピックグループに分類する。トピック分類は、コメントを解析するために、bag-of-wordsで表現された文書生成過程を確率的にモデル化したものである（例えば非特許文献２参照）。 <LDA analysis of topic classification unit 111>
The topic classification unit 111 classifies each comment into any one topic group using an LDA (Latent Dirichlet Allocation) algorithm that calculates a probability (topic ratio) belonging to each topic group. Topic classification is a probabilistic model of a document generation process expressed in bag-of-words in order to analyze comments (see, for example, Non-Patent Document 2).

ＬＤＡは、単語文書行列を次元圧縮する技術（ＬＳＩ(latent Semantic Indexin)）に対して、単語の特徴ベクトルに揺らぎに基づく確率的な枠組みを導入したものである。その圧縮した次元の集合をトピックという。 LDA introduces a probabilistic framework based on fluctuations in a feature vector of a word to a technique (LSI (latent Semantic Indexin)) for dimensional compression of a word document matrix. The compressed set of dimensions is called a topic.

トピック分類部１１１は、以下のステップで処理を実行する。
（Ｓ１）多数のコメントから、単語毎の出現頻度（出現回数）をＬＤＡ処理へ入力する。そして、コメント毎に、各単語の出現頻度を計数する。
（Ｓ２）次に、トピック毎の単語分布や、コメント毎のトピック比率を取得する。このトピック比率によって、コメントが属するトピックグループに分類する。そして、トピックグループ毎に、全てのコメントに含まれる各単語の出現頻度を計数する。
（Ｓ３）最後に、コメント毎に、各トピックグループに属する単語を計数する。そして、コメントを計数値の高いトピックグループに分類する。 The topic classification unit 111 executes processing in the following steps.
(S1) From a large number of comments, the appearance frequency (number of appearances) for each word is input to the LDA process. And the appearance frequency of each word is counted for every comment.
(S2) Next, the word distribution for each topic and the topic ratio for each comment are acquired. This topic ratio classifies the topic group to which the comment belongs. And the appearance frequency of each word contained in all the comments is counted for every topic group.
(S3) Finally, the words belonging to each topic group are counted for each comment. Then, the comments are classified into topic groups with high count values.

図４は、トピック分類部におけるコメントの分類を表す説明図である。 FIG. 4 is an explanatory diagram showing the classification of comments in the topic classification unit.

図４によれば、トピック分類部１１１は、図３の多数のコメントを、結果的に３つのトピックグループに分類している。各コメントには、分類されたトピック番号が付与される。
そして、位置情報付きコメントは、位置情報付きコメント分類部１１２及び地域固有コメント分類部１１４へ出力される。
また、位置情報無しコメントは、地域固有コメント分類部１１４へ出力される。 According to FIG. 4, the topic classification unit 111 classifies the large number of comments in FIG. 3 into three topic groups as a result. Each comment is given a classified topic number.
The comments with position information are output to the comment classification section with position information 112 and the region-specific comment classification section 114.
In addition, the comment without position information is output to the region-specific comment classification unit 114.

＜トピック分類部１１１の述語項構造解析＞
トピック分類部１１１は更に、位置情報付きコメント及び位置情報無しコメントの述語項構造を解析し、伝聞推定表現を含むコメントを除去することも好ましい。 <Predicate term structure analysis of topic classifier 111>
It is also preferable that the topic classifying unit 111 further analyzes the predicate term structure of the comment with position information and the comment without position information, and removes the comment including the hearing estimation expression.

「述語項構造」とは、文章中の述語に対して「項」となる名詞句等を当てたものである。述語項構造を用いることによって、文章の意味の骨格を把握することができる。述語項構造は、「述語」に対する「目的語」とその格とから構成される。述語項構造解析として、例えばフリーソフトであるSyncha等の述語項構造解析器を用いることができる。 “Predicate term structure” refers to a noun phrase or the like that becomes a “term” to a predicate in a sentence. By using the predicate term structure, it is possible to grasp the skeleton of the meaning of the sentence. The predicate term structure is composed of “object” for “predicate” and its case. As the predicate term structure analysis, for example, a predicate term structure analyzer such as Syncha which is free software can be used.

文書をbug of words（BOW：単語の集合）で表現する代わりに、述語項構造を用いてトピック分類を実行する。各例文から以下のような述語項構造が取得される。
例えばコメントが、以下のように解析される。
コメント１「ヘリコプタが集まっている」
->述語「集まる（状況）」、ガ格「ヘリコプタ」
コメント２「ヘリコプタが集まってきている」
->述語「集まる（状況）」、ガ格「ヘリコプタ」
コメント３「ヘリコプタが集まっているそうだ」
->述語「集まる（伝聞）」、ガ格「ヘリコプタ」
コメント４「ヘリコプタが集まっているらしい」
->述語「集まる（推定）」、ガ格「ヘリコプタ」 Instead of expressing a document as a bug of words (BOW), topic classification is performed using a predicate term structure. The following predicate term structure is acquired from each example sentence.
For example, a comment is analyzed as follows.
Comment 1 “Helicopters are gathering”
-> Predicate “Gather (situation)”, Ga ’s “Helicopter”
Comment 2 “Helicopters are gathering”
-> Predicate “Gather (situation)”, Ga ’s “Helicopter”
Comment 3 “Helicopters are gathering”
-> Predicate “Gather (hearing)”, Ga ’s “Helicopter”
Comment 4 “Helicopters seem to be gathering”
-> Predicate “Gather (estimate)”, Ga ’s “Helicopter”

前述の例によれば、同じ述語「集まる」であっても、コメント１，２の「集まる（状況）」と、コメント３，４の「集まる（伝聞・推定）」とを区別することができる。尚、既存のトピック分類であるBOWを用いた場合、これらコメントの特徴量は同じであって、それらを区別して解析することができない。即ち、本発明によれば、コメント１，２とコメント３，４とを区別してトピック分類することができ、異なるトピック番号が付与されることなる。 According to the above-described example, even if the same predicate is “gather”, it is possible to distinguish between “gather (situation)” of comments 1 and 2 and “gather (hearing / estimation)” of comments 3 and 4. . If BOW, which is an existing topic classification, is used, the feature quantities of these comments are the same and cannot be analyzed separately. That is, according to the present invention, the comments 1 and 2 and the comments 3 and 4 can be distinguished and classified into topics, and different topic numbers are assigned.

述語項構造解析によれば、コメント１，２は、その現場の地域範囲から投稿された状況表現であると解析できる。また、コメント３，４は、その現場の地域範囲以外の位置から投稿された伝聞推定表現であると解析できる。この場合、伝聞推定表現を含むコメント３，４を除去する。これによって、その現場の地域範囲から投稿されたコメント（位置情報付き又は位置情報無しに拘わらず）のみ、地図上に対応付けることができる。 According to the predicate term structure analysis, the comments 1 and 2 can be analyzed as situation expressions posted from the local range of the site. In addition, the comments 3 and 4 can be analyzed to be hearing estimation expressions posted from positions other than the area range of the site. In this case, the comments 3 and 4 including the hearing estimation expression are removed. As a result, only comments (regardless of position information or without position information) posted from the local area of the site can be associated on the map.

［位置情報付きコメント分類部１１２］
位置情報付きコメント分類部１１２は、複数の位置情報付きコメントを、当該位置情報が属する所定地域範囲毎に分類する。 [Comment classification unit 112 with position information]
The comment classification unit 112 with position information classifies a plurality of comments with position information for each predetermined area range to which the position information belongs.

図５は、位置情報付きコメントと地域固有トピック番号と表す説明図である。 FIG. 5 is an explanatory diagram showing comments with position information and region-specific topic numbers.

図５によれば、位置情報付きコメントの位置情報を、地図上にプロットされている。地図は、予め所定地域範囲毎に区分されている。これによって、所定地域範囲毎に、その範囲に含まれる位置情報付きコメントが分類される。 According to FIG. 5, the position information of the comments with position information is plotted on the map. The map is divided in advance for each predetermined area range. As a result, the comments with position information included in the predetermined area range are classified.

［地域固有トピック番号抽出部１１３］
地域固有トピック番号抽出部１１３は、所定の地域範囲毎に、偏在度に基づく所定規準を用いて、複数の位置情報付きコメントの中で、最も多いトピック番号を地域固有トピック番号として抽出する。ここで、偏在度に基づく所定規準は、例えば赤池情報量基準(ＡＩＣ：Akaike's Information Criterion)であってもよい。 [Regional topic number extraction unit 113]
The region-specific topic number extraction unit 113 extracts a topic number having the largest number as a region-specific topic number among a plurality of comments with position information, using a predetermined criterion based on the uneven distribution degree for each predetermined region range. Here, the predetermined criterion based on the uneven distribution degree may be, for example, Akaike's Information Criterion (AIC).

ここで、地域範囲Ｃ1,Ｃ2,・・・に割り当てられた位置情報付きコメントのトピック番号を、t1,t2,・・・とする。以下では、トピック番号tが、地域範囲Ｃの判別に役立つかどうかの指標Info(Ｃ,t)の算出方法を表す。 Here, the topic numbers of the comments with position information assigned to the area ranges C1, C2,... Are t1, t2,. In the following, a calculation method of an index Info (C, t) indicating whether the topic number t is useful for discrimination of the area range C will be described.

（Ｓ１）地域範囲に含まれるコメントの集合Ｕから、以下の４種類の頻度を得る。
n11＝地域範囲iにおけるトピック番号jの数
n12＝地域範囲iにおけるトピック番号j以外の数
n21＝地域範囲i以外におけるトピック番号jの数
n22＝地域範囲i以外におけるトピック番号j以外の数 (S1) The following four types of frequencies are obtained from the set of comments U included in the area range.
n11 = number of topic numbers j in region range i
n12 = Number other than topic number j in region range i
n21 = number of topic numbers j outside the regional range i
n22 = Number other than topic number j outside region range i

（Ｓ２）次に、ｎ11,ｎ12,ｎ21,ｎ22に対して、赤池情報量規準を用いて、独立モデルに対する値MLL_IM(Ｃ,t)及び従属モデルに対する値MLL_DM(Ｃ,t)を算出する。これは、地域範囲とトピック番号との組毎の不当割合を算出する。
MLL_IM(Ｃ,t)＝(n11+n12) log(n11+n12)
＋(n11+n21) log(n11+n21)
＋(n21+n22) log(n21+n22)
＋(n12+n22) log(n12+n22)−2N log N
MLL_DM(Ｃ,t)＝n11 log n11＋n12 log n12＋n21 log n21＋n22 log n22−N log N
但し、N＝n11＋n12＋n21＋n22 (S2) Next, with respect to n11, n12, n21, and n22, the value MLL_IM (C, t) for the independent model and the value MLL_DM (C, t) for the dependent model are calculated using the Akaike information criterion. This calculates an unreasonable ratio for each set of area range and topic number.
MLL_IM (C, t) = (n11 + n12) log (n11 + n12)
+ (N11 + n21) log (n11 + n21)
+ (N21 + n22) log (n21 + n22)
+ (N12 + n22) log (n12 + n22) -2N log N
MLL_DM (C, t) = n11 log n11 + n12 log n12 + n21 log n21 + n22 log n22−N log N
However, N = n11 + n12 + n21 + n22

（Ｓ３）前述のMLL_IM(Ｃ,t)及びMLL_DM(Ｃ,t)から、以下のInfo(Ｃ,t)を算出する。
AIC_IM(Ｃ,t)＝-2 × MLL_IM(Ｃ,t) ＋ 2×2
AIC_DM(Ｃ,t)＝-2 × MLL_DM(Ｃ,t) ＋ 2×3
Info(Ｃ,t)＝AIC_IM(Ｃ,t) − AIC_DM(Ｃ,t) (S3) The following Info (C, t) is calculated from the aforementioned MLL_IM (C, t) and MLL_DM (C, t).
AIC_IM (C, t) = -2 × MLL_IM (C, t) + 2 × 2
AIC_DM (C, t) = -2 × MLL_DM (C, t) + 2 × 3
Info (C, t) = AIC_IM (C, t) − AIC_DM (C, t)

前述で算出されたInfo(Ｃ,t)は、トピック番号tが地域範囲Ｃに偏って出現する不当割合を表す。Info(Ｃ,t)は、赤池情報量基準に従って、地域範囲Ｃの判別に役立つトピック番号tほど、Info(Ｃ,t)の値が高くなる。本発明によれば、各地域範囲Ｃiに対し、Info(Ｃ,t)の値が大きい順に、ｍ個のトピック番号tを抽出することができる。 Info (C, t) calculated above represents an unreasonable proportion in which the topic number t appears with a bias toward the regional range C. In Info (C, t), according to the Akaike information criterion, the value of Info (C, t) increases as the topic number t is useful for discrimination of the region range C. According to the present invention, m topic numbers t can be extracted in descending order of Info (C, t) for each area range Ci.

［地域固有コメント分類部１１４］
地域固有コメント分類部１１４は、各地域範囲の地域固有トピック番号に基づいて、位置情報付きコメント及び位置情報無しコメントを、地域固有コメントとして分類する。 [Region specific comment classification unit 114]
The region-specific comment classification unit 114 classifies the comment with position information and the comment without position information as region-specific comments based on the region-specific topic number of each region range.

図６は、地域固有トピック番号によって分類されたコメントを明示する説明図である。 FIG. 6 is an explanatory diagram that clearly shows comments classified by region-specific topic numbers.

図６によれば、例えば地域範囲ａに位置する位置情報付きコメントの中で、トピック番号１の数が最も多いとする。この場合、トピック番号１のコメントは、地域範囲ａ内から投稿されたものと推定することができる。これによって、地域範囲ａに、トピック番号１の位置情報付きコメント及び位置情報無しコメントを、地域固有コメントとして分類することができる。 According to FIG. 6, for example, it is assumed that the number of topic numbers 1 is the largest among comments with position information located in the area range a. In this case, it can be estimated that the comment of the topic number 1 is posted from within the area range a. Thereby, the comment with the position information and the comment without the position information of the topic number 1 can be classified as the region-specific comments in the region range a.

［地域範囲クラスタリング部１１５］
地域範囲クラスタリング部１１５は、隣接する両方の地域範囲における地域固有トピック番号が同一である場合、両方の地域範囲をクラスタリングして、１つの地域範囲として地域固有のコメントを統合する。 [Region Range Clustering Unit 115]
If the region-specific topic numbers in both adjacent region ranges are the same, the region range clustering unit 115 clusters both region ranges and integrates the region-specific comments as one region range.

地域範囲としては、緯度・経度に基づくメッシュ区画であってもよいし、地名に基づく市町村範囲であってもよい。これら地域範囲は、地理的に階層化されており、下位の地域範囲から上位の地域範囲へクラスタで統合することもできる（ラティス構造）。前述した赤池情報量規準を用いた場合、下位の地域範囲の情報量n11,n12,n21,n22は既に算出である。そのために、単に地域範囲を統合する場合であっても、それら情報量を統合するだけで、上位の地域範囲のトピック番号を算出することもできる。 The area range may be a mesh section based on latitude and longitude, or may be a municipal area based on a place name. These regional ranges are geographically hierarchized and can be integrated in a cluster from a lower regional range to a higher regional range (lattice structure). When the Akaike information criterion is used, the information amounts n11, n12, n21, and n22 of the lower regional range have already been calculated. Therefore, even when the regional ranges are simply integrated, it is possible to calculate the topic numbers of the higher regional ranges simply by integrating these information amounts.

図２によれば、コメント分類サーバ１は更に、期間別コメント記憶部１１６と、頻出単語抽出部１１７と、恒常的トピック番号抽出部１１８と、恒常的コメント削除部１１９とを有する。これら機能構成部も、サーバに搭載されたコンピュータを機能させるプログラムを実行することによって実現される。また、これら機能構成部の処理の流れは、装置のコメント分類方法としても理解できる。 According to FIG. 2, the comment classification server 1 further includes a period comment storage unit 116, a frequent word extraction unit 117, a constant topic number extraction unit 118, and a constant comment deletion unit 119. These functional components are also realized by executing a program that causes a computer mounted on the server to function. Further, the processing flow of these functional components can be understood as a comment classification method of the apparatus.

［期間別コメント記憶部１１６］
期間別コメント記憶部１１６は、所定期間毎に、地域固有コメントを記憶する。各地域範囲の各地域トピック番号に基づく地域固有コメントを、所定期間として例えば１日単位で記憶する。 [Period comment storage unit 116]
The period-specific comment storage unit 116 stores region-specific comments for each predetermined period. An area-specific comment based on each area topic number in each area range is stored as a predetermined period, for example, in units of one day.

［頻出単語抽出部１１７］
頻出単語抽出部１１７は、各地域範囲の各地域固有トピック番号に基づく地域固有コメントの中で、偏在度に基づく所定規準を用いて、所定数（又は所定割合）以上多く出現する頻出単語を抽出する。頻出単語抽出部１１７における偏在度に基づく所定規準は、赤池情報量基準であってもよい。以下では、単語wが、地域範囲Ｃ及びトピック番号tの判別に役立つかどうかの指標Info(Ｃ,t,w)の算出方法を表す。 [Frequent word extraction unit 117]
The frequent word extraction unit 117 extracts a frequent word that appears more than a predetermined number (or a predetermined ratio) using a predetermined criterion based on the degree of uneven distribution in a region-specific comment based on each region-specific topic number in each region range. To do. The predetermined criterion based on the degree of uneven distribution in the frequent word extraction unit 117 may be an Akaike information amount criterion. In the following, a calculation method of an index Info (C, t, w) indicating whether or not the word w is useful for discrimination of the area range C and the topic number t will be described.

（Ｓ１）地域範囲Ｃi及びトピック番号jの投稿文に偏在して現れる単語wに対して、
以下の４種類の頻度を得る。
m11＝地域範囲Ｃiにおけるトピック番号jの投稿文の中で、
単語wを含む投稿文数
m12＝地域範囲Ｃiにおけるトピック番号j以外の投稿文の中で、
単語wを含む投稿文数
m21＝地域範囲Ｃiにおけるトピック番号jの投稿文の中で、
単語wを含まない投稿文数
m22＝地域範囲Ｃiにおけるトピック番号j以外の投稿文の中で、
単語wを含まない投稿文数
この頻度は、所定期間に応じて、例えば日毎に算出される。 (S1) For a word w that appears ubiquitously in the posted text in the area range Ci and topic number j,
The following four types of frequencies are obtained.
m11 = In the posting of topic number j in the regional range Ci,
Number of posts containing the word w
m12 = In a post other than topic number j in regional area Ci,
Number of posts containing the word w
m21 = In the post with topic number j in the regional range Ci,
Number of posts without word w
m22 = In a post other than topic number j in regional range Ci,
Number of posted sentences not including the word w This frequency is calculated, for example, every day according to a predetermined period.

（Ｓ２）m11,m12,m21,m22に対して、赤池情報量規準を用いて、独立モデルに対する値MLL_IM(Ｃ,t,w)及び従属モデルに対する値MLL_DM(Ｃ,t,w)を算出する。これは、地域範囲とトピック番号との組毎の不当割合を算出する。
MLL_IM(Ｃ,t,w)＝(m11+m12) log(m11+m12)
＋(m11+m21) log(m11+m21)
＋(m21+m22) log(m21+m22)
＋(m12+m22) log(m12+m22)−2M log M
MLL_DM(Ｃ,t,w)＝m11 log m11＋m12 log m12＋m21 log m21＋m22 log m22−M log M
但し、M＝m11＋m12＋m21＋m22 (S2) For m11, m12, m21, and m22, the value MLL_IM (C, t, w) for the independent model and the value MLL_DM (C, t, w) for the dependent model are calculated using the Akaike information criterion. . This calculates an unreasonable ratio for each set of area range and topic number.
MLL_IM (C, t, w) = (m11 + m12) log (m11 + m12)
+ (M11 + m21) log (m11 + m21)
+ (M21 + m22) log (m21 + m22)
+ (M12 + m22) log (m12 + m22) -2M log M
MLL_DM (C, t, w) = m11 log m11 + m12 log m12 + m21 log m21 + m22 log m22−M log M
However, M = m11 + m12 + m21 + m22

（Ｓ３）前述のMLL_IM(Ｃ,t,w)及びMLL_DM(Ｃ,t,w)から、以下のInfo(Ｃ,t,w)を算出する。
AIC_IM(Ｃ,t,w)＝-2×MLL_IM(Ｃ,t,w)＋2×2
AIC_DM(Ｃ,t,w)＝-2×MLL_DM(Ｃ,t,w)＋2×3
Info(Ｃ,t,w)＝AIC_IM(Ｃ,t,w)−AIC_DM(Ｃ,t,w) (S3) The following Info (C, t, w) is calculated from the aforementioned MLL_IM (C, t, w) and MLL_DM (C, t, w).
AIC_IM (C, t, w) =-2 × MLL_IM (C, t, w) + 2 × 2
AIC_DM (C, t, w) =-2 × MLL_DM (C, t, w) + 2 × 3
Info (C, t, w) = AIC_IM (C, t, w) −AIC_DM (C, t, w)

前述で算出されたInfo(Ｃ,t,w)は、地域範囲Ｃにあってトピック番号tの内容で投稿された投稿文に、単語wが偏って出現する不当割合を表す。Info(Ｃ,t,w)は、赤池情報量基準に従って、地域範囲Ｃ及びトピック番号tの判別に役立つ単語wほど、Info(Ｃ,t,w)の値が高くなる。本発明によれば、各地域範囲Ｃi及びトピック番号tに対し、Info(Ｃ,t,w)の値が大きい順に、ｌ個の単語wを抽出することができる。 The Info (C, t, w) calculated above represents an unreasonable proportion in which the word w appears unevenly in the posted text posted in the area range C with the content of the topic number t. In Info (C, t, w), according to the Akaike information criterion, the value of Info (C, t, w) increases as the word w is more useful in determining the region range C and topic number t. According to the present invention, for each area range Ci and topic number t, l words w can be extracted in descending order of Info (C, t, w).

［恒常的トピック番号抽出部１１８］
恒常的トピック番号抽出部１１８は、頻出単語抽出部１１７を用いて、現期間の頻出単語と、所定の過去期間の頻出単語とを比較して、所定割合以上同じ頻出単語が抽出される地域固有トピック番号を、恒常的トピック番号と判定する。ユーザとしては、地域範囲Ｃに基づく単語wを含む投稿文が、突発的な投稿文であるほど、閲覧する必要性が高く、ニュースとして捉えることができる。一方で、恒常的な投稿文は、閲覧する必要性が低いと捉えることができる。 [Constant topic number extraction unit 118]
The constant topic number extraction unit 118 uses the frequent word extraction unit 117 to compare the frequent words in the current period with the frequent words in a predetermined past period, and extract the same frequent words for a predetermined ratio or more. The topic number is determined as a permanent topic number. As a user, as the posted sentence including the word w based on the region range C is an unexpected posted sentence, it is more necessary to browse and can be regarded as news. On the other hand, it is possible to grasp that a permanent post is less necessary to browse.

過去ｎ日前について、w∈Ｗ（Ｃ,t,本日）となる単語wが出現したかどうかを、以下のように表す。
Exist（n日前を示す日番号,本日の日番号,Ｃ,t,w）
＝ 0（もしｎ日前のInfo(Ｃ,t,w)が所定閾値θ未満）、又は、
＝ 1（もしｎ日前のInfo(Ｃ,t,w)が所定閾値θ以上）
として、
E(n)＝Σw｛Exist(n日前の日番号(=n), 本日の日番号(=0),Ｃ,t,w)
×Info(Ｃ,t,w)｝÷Σw｛Info(Ｃ,t,w)｝
0.0≦E(n)≦1.0 Whether or not the word w satisfying w∈W (C, t, today) has appeared in the past n days ago is expressed as follows.
Exist (day number indicating n days ago, today's day number, C, t, w)
= 0 (if Info (C, t, w) n days ago is less than the predetermined threshold θ), or
= 1 (If Info (C, t, w) n days ago is greater than or equal to the predetermined threshold θ)
As
E (n) = Σw {Exist (Day number n days ago (= n), Today's day number (= 0), C, t, w)
× Info (C, t, w)} ÷ Σw {Info (C, t, w)}
0.0 ≦ E (n) ≦ 1.0

本日の偏在トピックの投稿文に出現する単語wが、n日前に出現する同じ地域範囲・同じ話題（トピック番号）の投稿文に出現する割合が高い場合、E(n)は1.0に近づく。
Eval＝Σn E(n)／｛nの種類数｝
Evalは、地域範囲Ｃで話題tが恒常的か否かを判定する。具体的には、Evalの値が所定閾値以上であれば、地域範囲Ｃにおける話題tは恒常的なものであると判定する。 E (n) approaches 1.0 when the ratio of the word w appearing in the posted text of today's unevenly distributed topic is high in the posted text of the same area range and the same topic (topic number) appearing n days ago.
Eval = Σn E (n) / {number of n types}
Eval determines whether the topic t is constant in the area range C. Specifically, if the value of Eval is equal to or greater than a predetermined threshold, it is determined that the topic t in the area range C is constant.

［恒常的コメント削除部１１９］
恒常的コメント削除部１１９は、地域固有コメントの中で、恒常的トピック番号に基づくコメントを削除する。 [Permanent comment deletion unit 119]
The permanent comment deletion unit 119 deletes a comment based on the permanent topic number in the region-specific comments.

例えば、成田市で、2014年4月8日にトピック番号17に基づく投稿文数が、以下のようなものであったとする。
成田市から投稿されたtopic番号=17の投稿数: 26件
成田市から投稿されたtopic番号=17以外の投稿数: 68件
成田市以外から投稿されたtopic番号=17の投稿数: 756件
成田市以外から投稿されたtopic番号=17以外の投稿数: 28,677件
その中で、例えば単語「空港」について以下のように算出する。
m11＝成田市におけるトピック番号17の投稿文の中で、
単語「空港」を含む投稿文数
m12＝成田市におけるトピック番号17以外の投稿文の中で、
単語「空港」を含む投稿文数
m21＝成田市におけるトピック番号17の投稿文の中で、
単語「空港」を含まない投稿文数
m22＝成田市におけるトピック番号17以外の投稿文の中で、
単語「空港」を含まない投稿文数
そして、2014年4月8日に成田市におけるトピック番号17に基づく投稿文の中で、頻出単語＝「空港」「成田」「飛行機」「国際」を抽出したとする。これら頻出単語wは、所定の過去期間でも頻出単語として抽出されたとする。
この場合、成田市におけるトピック番号17の投稿文に出現するこれら単語wについて、Evalの値が比較的大きくなる。結果的に、成田市におけるトピック番号17の投稿文は、恒常的トピック番号と判定され、それら投稿文は削除される。 For example, in Narita, the number of posts based on topic number 17 on April 8, 2014 is as follows.
Number of posts with topic number = 17 posted from Narita City: 26 Number of posts with topic number = 17 posted from Narita City: 68 Number of posts with topic number = 17 posted from outside Narita City: 756 Number of posts other than topic number = 17 posted from other than Narita City: 28,677 cases. For example, the word “airport” is calculated as follows.
m11 = In the post with topic number 17 in Narita,
Number of posts containing the word “airport”
m12 = In posts other than topic number 17 in Narita City,
Number of posts containing the word “airport”
m21 = In the post with topic number 17 in Narita City,
Number of posts without the word “airport”
m22 = In posts other than topic number 17 in Narita City,
Number of posts that do not include the word “airport” And, in the post based on topic number 17 in Narita City on April 8, 2014, the frequent words = “Airport” “Narita” “Airplane” “International” are extracted Suppose that It is assumed that these frequent words w are extracted as frequent words even in a predetermined past period.
In this case, the Eval value is relatively large for these words w appearing in the posted text of topic number 17 in Narita. As a result, the posted message with topic number 17 in Narita is determined as a permanent topic number, and these posted messages are deleted.

図７は、本発明におけるシーケンス図である。 FIG. 7 is a sequence diagram in the present invention.

図７によれば、位置情報付きコメント（文章）と位置情報無しコメント（文章）とが混在する多数のコメントを収集し、地域固有コメントを抽出するコメント分類サーバの処理（装置のコメント分類方法）を表す。
（Ｓ１１１）コメント分類サーバ１は、最初に、多数のコメントを、単語の出現頻度の分布に基づいて複数個のトピックグループに分類し、各コメントにトピック番号を付与する（前述した図２のトピック分類部１１１参照）。
（Ｓ１１２）次に、コメント分類サーバ１は、複数の位置情報付きコメントを、当該位置情報が属する所定地域範囲毎に分類する（前述した図２の位置情報付きコメント分類部１１２参照）。
（Ｓ１１３）次に、コメント分類サーバ１は、所定の地域範囲毎に、偏在度に基づく所定規準を用いて、複数の位置情報付きコメントの中で、最も多いトピック番号を地域固有トピック番号として抽出する（前述した図２の地域固有トピック番号抽出部１１３参照）。
（Ｓ１１４）そして、コメント分類サーバ１は、各地域範囲の地域固有トピック番号に基づいて、位置情報付きコメント及び位置情報無しコメントを、地域固有コメントとして分類する（前述した図２の地域固有コメント分類部１１４参照）。ここで、複数の地域範囲を統合してクラスタリングするものであってもよい。
そして、閲覧者が操作する端末４は、アクセスネットワーク及びインターネットを介して、コメント分類サーバ１へアクセスする。端末４は、地域範囲をクエリとしてコメント分類サーバ１へ送信することによって、コメント分類サーバ１からその地域範囲の地域固有コメントを受信することができる。 According to FIG. 7, processing of the comment classification server that collects a large number of comments in which comments (sentences) with position information and comments (sentences) without position information are mixed and extracts region-specific comments (device comment classification method) Represents.
(S111) First, the comment classification server 1 classifies a large number of comments into a plurality of topic groups based on the distribution of word appearance frequencies, and assigns topic numbers to the comments (topics in FIG. 2 described above). Classification unit 111).
(S112) Next, the comment classification server 1 classifies a plurality of comments with position information for each predetermined area range to which the position information belongs (see the comment classification section with position information 112 in FIG. 2 described above).
(S113) Next, the comment classification server 1 extracts, as a region-specific topic number, a topic number that is the largest among a plurality of comments with position information using a predetermined criterion based on the uneven distribution degree for each predetermined region range. (See the above-mentioned region-specific topic number extraction unit 113 in FIG. 2).
(S114) Then, the comment classification server 1 classifies the comment with position information and the comment without position information as the area-specific comment based on the area-specific topic number of each area range (area-specific comment classification in FIG. 2 described above). Part 114). Here, a plurality of regional ranges may be integrated and clustered.
Then, the terminal 4 operated by the viewer accesses the comment classification server 1 via the access network and the Internet. The terminal 4 can receive an area-specific comment of the area range from the comment classification server 1 by transmitting the area range as a query to the comment classification server 1.

（Ｓ１１６）所定期間毎に、地域固有コメントを記憶する（前述した図２の期間別コメント記憶部１１６参照）。
（Ｓ１１７）各地域範囲の各地域固有トピック番号に基づく地域固有コメントの中で、偏在度に基づく所定規準を用いて、所定数（又は所定割合）以上多く出現する頻出単語を抽出する（前述した図２の頻出単語抽出部１１７参照）。
（Ｓ１１８）現期間の頻出単語と、所定の過去期間の頻出単語とを比較して、所定割合以上同じ頻出単語が抽出される地域固有トピック番号を、恒常的トピック番号と判定する（前述した図２の恒常的トピック番号抽出部１１８参照）。
（Ｓ１１９）地域固有コメントの中で、恒常的トピック番号に基づくコメントを削除する（前述した図２の恒常的コメント削除部１１９参照）。 (S116) The region-specific comments are stored for each predetermined period (see the period-specific comment storage unit 116 in FIG. 2 described above).
(S117) Of the region-specific comments based on each region-specific topic number in each region range, a frequent word that appears more than a predetermined number (or a predetermined ratio) is extracted using a predetermined criterion based on the uneven distribution degree (described above) (See the frequent word extraction unit 117 in FIG. 2).
(S118) The frequent words in the current period are compared with the frequent words in a predetermined past period, and the region-specific topic number from which the same frequent words are extracted by a predetermined ratio or more is determined as the permanent topic number (see the above-described figure). 2 constant topic number extraction unit 118).
(S119) The comment based on the permanent topic number is deleted from the region-specific comments (see the above-mentioned permanent comment deletion unit 119 in FIG. 2).

以上、詳細に説明したように、本発明のコメント分類プログラム、サーバ及び方法によれば、位置情報付きコメントと位置情報無しコメントとが混在する大量のコメントから、地域固有コメントを抽出することができる。特に、本発明によれば、地名やランドマーク等の位置を特定する単語を必ずしも含んでいないコメントであっても、トピック分類に基づいて、地域固有に偏在するトピックを検出することができる。 As described above in detail, according to the comment classification program, server, and method of the present invention, it is possible to extract region-specific comments from a large number of comments in which comments with position information and comments without position information are mixed. . In particular, according to the present invention, even if a comment does not necessarily include a word specifying a position such as a place name or a landmark, a topic that is unevenly distributed in a region can be detected based on the topic classification.

また、本発明によれば、その地域範囲で発生した突発的な投稿文のみを抽出することもできる。即ち、その地域範囲で恒常的な話題の投稿文を削除することができる。これによって、例えば突発的に発生した災害等に関する投稿文が、ユーザに伝わりやすくなる。 In addition, according to the present invention, it is possible to extract only a sudden posted sentence that occurs in the area range. In other words, it is possible to delete a posting message that is a constant topic in the area range. As a result, for example, a posted text related to a disaster or the like that occurred suddenly is easily transmitted to the user.

尚、本願の発明者らは、ある地域で火災が発生したとき、その時間帯に投稿されたtwitterのコメントを大量に収集した。このとき、「火災」の文字を含む約26,000件のコメントのうち、位置情報付きコメントはわずか54件しかなかった。本発明を適用し、これら約26000件のコメントに対してＬＤＡのトピック分類を実行し、位置情報付きコメントの54件のトピック番号と同じコメントを、その位置情報に対応付けた。結果的に、位置情報無しコメントであっても、実際の火災現場の位置に対応付けることができた。 In addition, when the fire broke out in a certain area, the inventors of the present application collected a large number of twitter comments posted during that time zone. At this time, out of approximately 26,000 comments including the word “fire”, there were only 54 comments with location information. The present invention was applied, and LDA topic classification was executed for these approximately 26,000 comments, and the same comments as the 54 topic numbers of the comments with position information were associated with the position information. As a result, even comments without position information could be associated with actual fire site positions.

前述した本発明の種々の実施形態について、本発明の技術思想及び見地の範囲の種々の変更、修正及び省略は、当業者によれば容易に行うことができる。前述の説明はあくまで例であって、何ら制約しようとするものではない。本発明は、特許請求の範囲及びその均等物として限定するものにのみ制約される。 Various changes, modifications, and omissions of the above-described various embodiments of the present invention can be easily made by those skilled in the art. The above description is merely an example, and is not intended to be restrictive. The invention is limited only as defined in the following claims and the equivalents thereto.

１コメント分類サーバ
１０通信インタフェース部
１０１コメント収集部
１０２コメント蓄積部
１１０形態素解析部
１１１トピック分類部
１１２位置情報付きコメント分類部
１１３地域固有トピック番号抽出部
１１４地域固有コメント分類部
１１５地域範囲クラスタリング部
１１６期間別コメント記憶部
１１７頻出単語抽出部
１１８恒常的トピック番号抽出部
１１９恒常的コメント削除部
２ミニブログサーバ
３投稿者用端末
４閲覧者用端末 DESCRIPTION OF SYMBOLS 1 Comment classification | category server 10 Communication interface part 101 Comment collection part 102 Comment accumulation | storage part 110 Morphological analysis part 111 Topic classification part 112 Comment classification part with a positional information 113 Area specific topic number extraction part 114 Area specific comment classification part 115 Area range clustering part 116 Comment storage unit by period 117 Frequent word extraction unit 118 Permanent topic number extraction unit 119 Permanent comment deletion unit 2 Miniblog server 3 Contributor terminal 4 Reader terminal

Claims

A comment classification program that causes a computer to function to extract region-specific comments from a large number of comments in which comments with position information (sentences) and comments without position information (sentences) are mixed,
A topic classification means for classifying a large number of comments into a plurality of topic groups based on the distribution of word frequencies, and assigning a topic number to each comment;
Comment classification means with position information for classifying a plurality of comments with position information for each predetermined area range to which the position information belongs;
An area-specific topic number extracting means for extracting the most topic number as an area-specific topic number among a plurality of comments with position information using a predetermined criterion based on the uneven distribution degree for each predetermined area range;
A comment classification program that causes a computer to function as an area-specific comment classification unit that classifies comments with position information and comments without position information as area-specific comments based on the area-specific topic number of each area range.

The topic classification means causes the computer to function so as to classify each comment into any one topic group using an LDA (Latent Dirichlet Allocation) algorithm that calculates a probability (topic ratio) belonging to each topic group. The comment classification program according to claim 1, wherein:

The comment classification program according to claim 1 or 2, wherein the computer functions so that the predetermined criterion based on the degree of uneven distribution in the region-specific topic number extraction unit is an Akaike information amount criterion.

Comment storage means by period for storing the region-specific comments for each predetermined period;
A frequent word extracting means for extracting frequent words that appear more than a predetermined number (or a predetermined ratio) using a predetermined criterion based on the uneven distribution degree in a region-specific comment based on each region-specific topic number of each region range;
A constant topic number extracting means for comparing a frequent word in a current period with a frequent word in a predetermined past period and determining, as a constant topic number, a region-specific topic number from which the same frequent word is extracted by a predetermined ratio or more. ,
The comment according to any one of claims 1 to 3, wherein the computer is further caused to function as a permanent comment deletion unit that deletes a comment based on the permanent topic number among the region-specific comments. Classification program.

5. The comment classification program according to claim 4, wherein the computer is caused to function such that the predetermined criterion based on the uneven distribution degree in the frequent word extracting means is the Akaike information amount criterion.

6. The topic classifying unit analyzes a predicate term structure of a comment with position information and a comment without position information, and causes the computer to function so as to remove a comment including a hearing estimation expression. The comment classification program according to item 1.

If the region-specific topic numbers in the two adjacent region ranges are the same, cluster the both region ranges to further function the computer as a region range clustering unit that integrates region-specific comments as one region range. The comment classification program according to any one of claims 1 to 6, wherein:

The computer according to any one of claims 1 to 7, wherein the computer functions so that the comment is posted to a mini blog (mini Web log) server by an unspecified number of third parties. The comment classification program described.

A comment classification server that collects a large number of comments in which comments (sentences) with location information and comments (sentences) without location information are mixed, and extracts region-specific comments,
A topic classification means for classifying a large number of comments into a plurality of topic groups based on the distribution of word frequencies, and assigning a topic number to each comment;
Comment classification means with position information for classifying a plurality of comments with position information for each predetermined area range to which the position information belongs;
An area-specific topic number extracting means for extracting the most topic number as an area-specific topic number among a plurality of comments with position information using a predetermined criterion based on the uneven distribution degree for each predetermined area range;
A comment classification server, comprising: a region-specific comment classification unit that classifies comments with position information and comments without position information as region-specific comments based on the region-specific topic number of each region range.

Comment storage means by period for storing the region-specific comments for each predetermined period;
A frequent word extracting means for extracting frequent words that appear more than a predetermined number (or a predetermined ratio) using a predetermined criterion based on the uneven distribution degree in a region-specific comment based on each region-specific topic number of each region range;
A constant topic number extracting means for comparing a frequent word in a current period with a frequent word in a predetermined past period and determining, as a constant topic number, a region-specific topic number from which the same frequent word is extracted by a predetermined ratio or more. ,
The comment classification server according to claim 9, further comprising: a permanent comment deleting unit that deletes a comment based on the permanent topic number in the region-specific comment.

A comment classification method for a device that collects a large number of comments in which comments (sentences) with location information and comments (sentences) without location information are mixed, and extracts region-specific comments,
A first step of classifying a number of comments into a plurality of topic groups based on a distribution of word appearance frequencies, and assigning a topic number to each comment;
A second step of classifying a plurality of comments with position information for each predetermined area range to which the position information belongs;
A third step of extracting the largest topic number as a region-specific topic number among a plurality of comments with position information using a predetermined criterion based on the degree of uneven distribution for each predetermined region range;
A comment classification method for an apparatus, comprising: a fourth step of classifying comments with position information and comments without position information as area-specific comments based on the area-specific topic number of each area range.

A fifth step of storing the region-specific comments for each predetermined period;
A sixth step of extracting frequent words appearing more than a predetermined number (or a predetermined ratio) using a predetermined criterion based on the uneven distribution in the region-specific comments based on each region-specific topic number of each region range;
A seventh step of comparing the frequent words of the current period with the frequent words of a predetermined past period, and determining, as a permanent topic number, a region-specific topic number from which the same frequent words are extracted by a predetermined ratio or more;
The comment classification method according to claim 11, further comprising an eighth step of deleting a comment based on the permanent topic number among the region-specific comments.