JP5711689B2

JP5711689B2 - Topic word extraction device, topic word extraction method, and program

Info

Publication number: JP5711689B2
Application number: JP2012070736A
Authority: JP
Inventors: 智愛成
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2012-03-27
Filing date: 2012-03-27
Publication date: 2015-05-07
Anticipated expiration: 2032-03-27
Also published as: JP2013205864A

Description

本発明は、インターネットを介したユーザ間のコミュニケーションのために発信された情報から話題語を抽出する話題語抽出装置、話題語抽出方法、およびプログラムに関する。 The present invention relates to a topic word extraction device, a topic word extraction method, and a program for extracting a topic word from information transmitted for communication between users via the Internet.

近年、Ｅメール、ブログ、ＳＮＳ、ＴＷＩＴＴＥＲ（登録商標）等による、インターネットを介したユーザ間のコミュニケーションが爆発的に増加している。インターネットを介したユーザ間のコミュニケーションのためにＥメールやブログ等により発信された情報には、ユーザの感情表現や消費動向が現れており、また、口コミとしての宣伝効果もある。そのため、インターネットを介したユーザ間のコミュニケーションのために発信された情報を収集、分析することは重要となっている。そこで、インターネットを介したユーザ間のコミュニケーションのために発信された情報から、発信された情報の内容の特徴を表わし、話題になっている単語（以下、話題語という）を抽出することが益々重要となっている。 In recent years, communication between users via the Internet by e-mail, blog, SNS, TWITTER (registered trademark), and the like has increased explosively. The information transmitted by e-mail, blog, etc. for communication between users via the Internet shows the user's emotional expression and consumption trend, and also has an advertising effect as a word of mouth. Therefore, it is important to collect and analyze information transmitted for communication between users via the Internet. Therefore, it is more and more important to extract the words that are the topic (hereinafter referred to as topic words) from the information transmitted for communication between users via the Internet, indicating the characteristics of the content of the transmitted information. It has become.

例えば、文書のキーワードを抽出する従来例として、統計的な手法で文書のキーワードを抽出する方法としてｔｆ−ｉｄｆ法がある（非特許文献１参照）。この「ｔｆ−ｉｄｆ法」は、処理対象の文書において、ある単語の出現頻度と、その単語が出現した文書の数とに基づいて、処理対象文書中で多数出現し、他の文書での出現数が少ない単語に、高い重要度を算出し、この算出された重要度に基づいて、キーワードを抽出する方法である。 For example, as a conventional example of extracting document keywords, there is a tf-idf method as a method of extracting document keywords by a statistical method (see Non-Patent Document 1). This “tf-idf method” is a method in which a large number of words appear in a processing target document based on the frequency of occurrence of a word and the number of documents in which the word appears, and appear in other documents. In this method, a high importance level is calculated for a few words, and keywords are extracted based on the calculated importance level.

ＧｅｒａｒｄＳａｌｔｏｎ，ＭｉｃｈａｅｌＪ．ＭｃＧｉｌｌ“ＩｎｔｒｏｄｕｃｔｉｏｎｔｏＭｏｄｅｒｎＩｎｆｏｒｍａｔｉｏｎＲｅｔｒｉｅｖａｌ”，ＭｃＧｒａｗ−Ｈｉｌｌ，１９８３．Gerard Salton, Michael J. et al. McGill “Introduction to Modern Information Retrieval”, McGraw-Hill, 1983.

しかしながら、インターネットを介したユーザ間のコミュニケーションのために発信された情報は、会話文に近く、コミュニケーションを円滑にするための言葉が、話題とは関係なく頻繁に用いられるという特徴がある。そのため、出現数が多い単語を話題語としてしまうと、コミュニケーションを円滑にするための言葉が含まれてしまい、真の話題語以外の単語も話題語として抽出されてしまい、非特許文献１の「ｔｆ−ｉｄｆ法」を用いて、インターネットを介したユーザ間のコミュニケーションのために発信された情報から抽出した話題語は、精度が悪く、そのままでは分析に用いることができないという問題点があった。 However, information transmitted for communication between users via the Internet is close to a conversation sentence, and there is a feature that words for facilitating communication are frequently used regardless of the topic. Therefore, if a word with a large number of appearances is used as a topic word, words for facilitating communication are included, and words other than the true topic word are also extracted as topic words. A topic word extracted from information transmitted for communication between users via the Internet using the “tf-idf method” has a problem that accuracy is poor and cannot be used for analysis as it is.

そこで、本発明は、上述の課題に鑑みてなされたものであり、インターネットを介したユーザ間のコミュニケーションのために発信された情報から真の話題語を抽出する話題語抽出装置、話題語抽出方法、およびプログラムを提供することを目的とする。 Accordingly, the present invention has been made in view of the above-described problems, and a topic word extraction device and a topic word extraction method for extracting a true topic word from information transmitted for communication between users via the Internet. And to provide a program.

本発明は、上記の課題を解決するために、以下の事項を提案している。なお、理解を容易にするために、本発明の実施形態に対応する符号を付して説明するが、これに限定されるものではない。 The present invention proposes the following matters in order to solve the above problems. In addition, in order to make an understanding easy, although the code | symbol corresponding to embodiment of this invention is attached | subjected and demonstrated, it is not limited to this.

（１）本発明は、インターネットを介したユーザ間のコミュニケーションのために発信されたコミュニケーション情報から話題語を抽出する話題語抽出装置であって、前記コミュニケーション情報を蓄積している１以上のコミュニケーション情報蓄積サーバから、所定期間のコミュニケーション情報を取得するコミュニケーション情報取得手段（例えば、図１のコミュニケーション情報取得部１１０に相当）と、取得したコミュニケーション情報に含まれるテキスト情報を形態素解析し、単語を抽出する形態素解析手段（例えば、図１の形態素解析部１２０に相当）と、前記形態素解析手段で抽出された単語毎の重要度を算出する重要度算出手段（例えば、図１の重要度算出部１３０に相当）と、前記コミュニケーション情報取得手段で取得されたコミュニケーション情報を参照して、前記形態素解析手段で抽出された単語毎に、各単語を前記テキスト情報に含むコミュニケーション情報を発信したユニークユーザ数を取得するユニークユーザ数取得手段（例えば、図１のユニークユーザ数取得部１４０に相当）と、前記ユニークユーザ数取得手段で取得されたユニークユーザ数と、前記コミュニケーション情報取得手段で取得されたコミュニケーション情報の数とに基づいて、ユニークユーザを考慮する重み付け係数を算出する重み付け係数算出手段（例えば、図１の重み付け係数算出部１５０に相当）と、前記重要度算出手段で算出された単語毎の重要度と、前記重み付け係数算出手段で算出された重み付け係数とに基づいて、前記形態素解析手段で抽出された単語から話題語を抽出する話題語抽出手段（例えば、図１の話題語抽出部１６０に相当）と、を備える話題語抽出装置を提案している。 (1) The present invention is a topic word extraction device for extracting a topic word from communication information transmitted for communication between users via the Internet, and one or more pieces of communication information storing the communication information Communication information acquisition means (for example, equivalent to the communication information acquisition unit 110 in FIG. 1) that acquires communication information for a predetermined period from the storage server, and morphological analysis is performed on text information included in the acquired communication information to extract words. Morphological analysis means (for example, equivalent to the morphological analysis unit 120 in FIG. 1) and importance calculation means for calculating the importance for each word extracted by the morpheme analysis means (for example, in the importance calculation unit 130 in FIG. 1) Equivalent) and the communication information acquisition means Unique user number acquisition means (for example, FIG. 1) that acquires the number of unique users who transmitted communication information including each word in the text information for each word extracted by the morphological analysis means with reference to the communication information The unique user number acquisition unit 140), the unique user number acquired by the unique user number acquisition unit, and the number of communication information acquired by the communication information acquisition unit Weighting coefficient calculating means for calculating a weighting coefficient (for example, equivalent to the weighting coefficient calculating section 150 in FIG. 1), importance for each word calculated by the importance calculating means, and calculated by the weighting coefficient calculating means From the words extracted by the morpheme analysis means based on the weighting coefficient, A topic word extraction device is provided that includes a topic word extraction means (for example, equivalent to the topic word extraction unit 160 in FIG. 1).

この発明によれば、コミュニケーション情報取得手段は、コミュニケーション情報を蓄積している１以上のコミュニケーション情報蓄積サーバから、所定期間のコミュニケーション情報を取得する。形態素解析手段は、取得したコミュニケーション情報に含まれるテキスト情報を形態素解析し、単語を抽出する。重要度算出手段は、形態素解析手段で抽出された単語毎の重要度を算出する。ユニークユーザ数取得手段は、コミュニケーション情報取得手段で取得されたコミュニケーション情報を参照して、形態素解析手段で抽出された単語毎に、各単語をテキスト情報に含むコミュニケーション情報を発信したユニークユーザ数を取得する。重み付け係数算出手段は、ユニークユーザ数取得手段で取得されたユニークユーザ数と、コミュニケーション情報取得手段で取得されたコミュニケーション情報の数とに基づいて、ユニークユーザを考慮する重み付け係数を算出する。話題語抽出手段は、重要度算出手段で算出された単語毎の重要度と、前記重み付け係数算出手段で算出された重み付け係数とに基づいて、前記形態素解析手段で抽出された単語から話題語を抽出する。したがって、インターネットを介したユーザ間のコミュニケーションのために発信されたコミュニケーション情報に含まれるテキスト情報の単語の中から、ユニークユーザ数を考慮した重要度に基づいて話題語を抽出することにより、インターネットを介したユーザ間のコミュニケーションにおいて、多くのユーザが話題にしている真の話題語を抽出することができる。 According to this invention, the communication information acquisition means acquires communication information for a predetermined period from one or more communication information storage servers that store communication information. The morpheme analysis unit performs morpheme analysis on text information included in the acquired communication information, and extracts words. The importance calculation means calculates the importance for each word extracted by the morpheme analysis means. The unique user number acquisition means refers to the communication information acquired by the communication information acquisition means, and acquires, for each word extracted by the morphological analysis means, the number of unique users who transmitted communication information including each word in the text information. To do. The weighting coefficient calculation means calculates a weighting coefficient that takes into account the unique user based on the number of unique users acquired by the unique user number acquisition means and the number of communication information acquired by the communication information acquisition means. The topic word extraction unit is configured to extract a topic word from the word extracted by the morpheme analysis unit based on the importance for each word calculated by the importance calculation unit and the weighting coefficient calculated by the weighting coefficient calculation unit. Extract. Therefore, by extracting topic words based on the importance considering the number of unique users from words of text information included in communication information transmitted for communication between users via the Internet, It is possible to extract true topic words that many users are talking about in communication between users via the Internet.

（２）本発明は、（１）の話題語抽出装置について、前記コミュニケーション情報が、他のユーザにより発信されたコミュニケーション情報を再発信しているか否かを示す再発信情報を含み、前記コミュニケーション情報取得手段で取得されたコミュニケーション情報を参照して、前記形態素解析手段で抽出された単語毎に、前記再発信情報に基づいて、前記他のユーザにより発信されたコミュニケーション情報を再発信しているコミュニケーション情報の数を示す再発信コミュニケーション情報数を取得する再発信コミュニケーション情報数取得手段（例えば、図３の再発信コミュニケーション情報数取得部２１０に相当）を備え、前記重み付け係数算出手段が、前記再発信コミュニケーション情報数取得手段で取得された再発信コミュニケーション情報数と、前記コミュニケーション情報取得手段で取得されたコミュニケーション情報の数とに基づいて、コミュニケーション情報の伝搬度合を考慮する重み付け係数を算出することを特徴とする話題語抽出装置を提案している。 (2) The present invention relates to the topic word extraction device according to (1), wherein the communication information includes retransmission information indicating whether or not the communication information is retransmitted communication information transmitted by another user, and the communication information Communication retransmitting communication information transmitted by the other user based on the retransmission information for each word extracted by the morphological analysis unit with reference to the communication information acquired by the acquisition unit Re-sending communication information number acquiring means (e.g., equivalent to the re-sending communication information number acquiring unit 210 in FIG. 3) for acquiring the number of re-sending communication information indicating the number of information. Re-transmission community acquired by means of acquiring communication information A topic word extraction device that calculates a weighting factor that takes into account the degree of propagation of communication information based on the number of application information and the number of communication information acquired by the communication information acquisition means .

この発明によれば、コミュニケーション情報が、他のユーザにより発信されたコミュニケーション情報を再発信しているか否かを示す再発信情報を含む。再発信コミュニケーション情報数取得手段は、コミュニケーション情報取得手段で取得されたコミュニケーション情報を参照して、形態素解析手段で抽出された単語毎に、再発信情報に基づいて、他のユーザにより発信されたコミュニケーション情報を再発信しているコミュニケーション情報の数を示す再発信コミュニケーション情報数を取得する。重み付け係数算出手段は、再発信コミュニケーション情報数取得手段で取得された再発信コミュニケーション情報数と、コミュニケーション情報取得手段で取得されたコミュニケーション情報の数とに基づいて、コミュニケーション情報の伝搬度合を考慮する重み付け係数を算出する。したがって、コミュニケーション情報の伝搬度合を考慮した重要度に基づいて話題語を抽出することにより、インターネットを介したユーザ間のコミュニケーションにおいて、ユーザ間に伝搬している真の話題語を抽出することができる。 According to this invention, the communication information includes retransmission information indicating whether or not the communication information transmitted by another user is retransmitted. The retransmitted communication information number acquiring means refers to the communication information acquired by the communication information acquiring means, and for each word extracted by the morpheme analyzing means, communication transmitted by another user based on the retransmitted information. The number of retransmitted communication information indicating the number of communication information retransmitting information is acquired. The weighting coefficient calculating means weights considering the propagation degree of communication information based on the number of retransmitted communication information acquired by the number of retransmitted communication information acquiring means and the number of communication information acquired by the communication information acquiring means. Calculate the coefficient. Therefore, by extracting topic words based on the importance taking into account the degree of propagation of communication information, it is possible to extract the true topic words propagating between users in communication between users via the Internet. .

（３）本発明は、（１）または（２）の話題語抽出装置について、前記コミュニケーション情報蓄積サーバは、前記ユーザ間のリンク関係を管理し、前記コミュニケーション情報蓄積サーバから、ユーザ間のリンク関係を取得するリンク関係取得手段（例えば、図３のリンク関係取得部２２０に相当）と、前記コミュニケーション情報取得手段で取得されたコミュニケーション情報を参照して、前記形態素解析手段で抽出された単語毎に、各単語を含む前記コミュニケーション情報を発信したユニークユーザを取得し、取得されたユニークユーザと前記リンク関係取得手段で取得された前記リンク関係とに基づいて、当該取得されたユニークユーザの中で他のユニークユーザとリンクしているリンクユーザ数を取得するリンクユーザ数取得手段（例えば、図３のリンクユーザ数取得部２３０に相当）と、を備え、前記重み付け係数算出手段が、前記リンクユーザ数取得手段で取得されたリンクユーザ数と、前記ユニークユーザ数取得手段で取得されたユニークユーザ数とに基づいて、前記ユーザ間のリンク関係を考慮する重み付け係数を算出することを特徴とする話題語抽出装置を提案している。 (3) The present invention relates to the topic word extraction device according to (1) or (2), wherein the communication information storage server manages the link relationship between the users, and the link relationship between the users from the communication information storage server. For each word extracted by the morpheme analyzing means with reference to the communication information obtained by the link relation obtaining means (e.g., corresponding to the link relation obtaining unit 220 in FIG. 3) and the communication information obtaining means. The unique user who transmitted the communication information including each word is acquired, and based on the acquired unique user and the link relationship acquired by the link relationship acquiring unit, the other unique user acquired Get the number of linked users to get the number of linked users linked to unique users (E.g., corresponding to the link user number acquisition unit 230 in FIG. 3), and the weighting coefficient calculation means includes the link user number acquired by the link user number acquisition means and the unique user number acquisition means. On the basis of the number of acquired unique users, a topic word extraction device is proposed that calculates a weighting factor that takes into account the link relationship between the users.

この発明によれば、コミュニケーション情報蓄積サーバは、ユーザ間のリンク関係を管理する。リンク関係取得手段は、コミュニケーション情報蓄積サーバから、ユーザ間のリンク関係を取得する。リンクユーザ数取得手段は、コミュニケーション情報取得手段で取得されたコミュニケーション情報を参照して、形態素解析手段で抽出された単語毎に、各単語を含むコミュニケーション情報を発信したユニークユーザを取得し、取得されたユニークユーザとリンク関係取得手段で取得されたリンク関係とに基づいて、取得されたユニークユーザの中で他のユニークユーザとリンクしているリンクユーザ数を取得する。重み付け係数算出手段は、リンクユーザ数取得手段で取得されたリンクユーザ数と、ユニークユーザ数取得手段で取得されたユニークユーザ数とに基づいて、ユーザ間のリンク関係を考慮する重み付け係数を算出する。したがって、ユーザ間のリンク関係を考慮した重要度に基づいて話題語を抽出することにより、リンク関係にあるユーザ間で話題になっている話題語を抽出することができる。 According to this invention, the communication information storage server manages the link relationship between users. The link relationship acquisition means acquires a link relationship between users from the communication information storage server. The link user number acquisition means acquires a unique user who has transmitted communication information including each word for each word extracted by the morphological analysis means with reference to the communication information acquired by the communication information acquisition means. Based on the unique user and the link relationship acquired by the link relationship acquisition means, the number of link users linked to other unique users among the acquired unique users is acquired. The weighting coefficient calculation means calculates a weighting coefficient that considers the link relationship between users based on the number of link users acquired by the link user number acquisition means and the number of unique users acquired by the unique user number acquisition means. . Therefore, by extracting the topic words based on the importance considering the link relationship between users, it is possible to extract the topic words that are the topic among the users having the link relationship.

（４）本発明は、（１）から（３）の話題語抽出装置について、前記インターネットを介したユーザ間のコミュニケーションを円滑にするために用いられるが、前記コミュニケーション情報に含まれるテキスト情報の内容の特徴を表す重要語でない単語を格納する、予め用意された辞書を用いて、前記形態素解析手段で抽出された単語から前記辞書に格納されている単語を除いて、重要語を抽出する重要語抽出手段（例えば、図４の重要語抽出部３１０に相当）を備え、前記重要度算出手段が、前記重要語抽出手段で抽出された重要語毎の重要度を算出し、前記話題語抽出手段が、前記重要度算出手段で算出された重要語毎の重要度と、前記重み付け係数算出手段で算出された重み付け係数とに基づいて、前記重要度算出手段で算出された重要語から話題語を抽出することを特徴とする話題語抽出装置を提案している。 (4) Although the present invention is used to facilitate communication between users via the Internet for the topic word extraction device of (1) to (3), the content of text information included in the communication information An important word for extracting an important word from a word extracted by the morpheme analyzing means by using a previously prepared dictionary that stores a word that is not an important word representing the feature of Extraction means (e.g., equivalent to the keyword extraction unit 310 in FIG. 4), the importance calculation means calculates the importance for each keyword extracted by the keyword extraction means, and the topic word extraction means Is calculated by the importance calculating means based on the importance for each important word calculated by the importance calculating means and the weighting coefficient calculated by the weighting coefficient calculating means. It has proposed a topic word extraction device and extracting the topic word from the main language.

この発明によれば、重要語抽出手段が、インターネットを介したユーザ間のコミュニケーションを円滑にするために用いられるが、コミュニケーション情報に含まれるテキスト情報の内容の特徴を表す重要語でない単語を格納する、予め用意された辞書を用いて、形態素解析手段で抽出された単語から辞書に格納されている単語を除いて、重要語を抽出する。重要度算出手段が、重要語抽出手段で抽出された重要語毎の重要度を算出する。話題語抽出手段が、重要度算出手段で算出された重要語毎の重要度と、重み付け係数算出手段で算出された重み付け係数とに基づいて、重要度算出手段で算出された重要語から話題語を抽出する。したがって、インターネットを介したユーザ間のコミュニケーションを円滑にするために用いられるが、コミュニケーションの内容の特徴を表す重要語でない単語を話題語から除くことができる。
（５）本発明は、（４）の話題語抽出装置について、前記辞書が、指示代名詞を格納する指示代名詞辞書（例えば、図４の指示代名詞辞書３２１に相当）、挨拶に用いる単語を格納する挨拶辞書（例えば、図４の挨拶辞書３２２に相当）、および時節毎に、時節に関連する単語を格納する時節別単語辞書（例えば、図４の時節別単語辞書３２３に相当）を含むことを特徴とする話題語抽出装置を提案している。 According to the present invention, the keyword extraction unit is used for facilitating communication between users via the Internet, but stores words that are not important words representing the characteristics of the contents of the text information included in the communication information. Using a dictionary prepared in advance, important words are extracted by removing words stored in the dictionary from words extracted by the morphological analysis means. The importance level calculating means calculates the importance level for each important word extracted by the important word extracting means. The topic word extraction means uses the importance words calculated by the importance calculation means based on the importance for each important word calculated by the importance calculation means and the weighting coefficient calculated by the weighting coefficient calculation means. To extract. Therefore, it is used to facilitate communication between users via the Internet, but words that are not important words representing the characteristics of the content of communication can be excluded from topic words.
(5) According to the present invention, in the topic word extraction device of (4), the dictionary stores a demonstrative pronoun dictionary that stores demonstrative pronouns (for example, equivalent to the demonstrative pronoun dictionary 321 in FIG. 4) and words used for greetings. Including a greeting dictionary (for example, equivalent to the greeting dictionary 322 in FIG. 4) and a time-dependent word dictionary (for example, equivalent to the time-dependent word dictionary 323 in FIG. 4) for storing words related to the time. We have proposed a featured topic word extraction device.

この発明によれば、辞書は、指示代名詞を格納する指示代名詞辞書、挨拶に用いる単語を格納する挨拶辞書、および時節毎に、時節に関連する単語を格納する時節別単語辞書を含む。したがって、コミュニケーション情報に含まれるテキスト情報から抽出された単語から、インターネットを介したユーザ間のコミュニケーションを円滑にするために用いられる単語である指示代名詞、挨拶、および時節に関連する単語を除くことによって、コミュニケーションの内容の特徴を表す重要語を抽出することができる。 According to the present invention, the dictionary includes an indicating pronoun dictionary that stores demonstrative pronouns, a greeting dictionary that stores words used for greetings, and a time-dependent word dictionary that stores words related to the time of each time. Therefore, by removing words related to pronouns, greetings, and time clauses, which are words used to facilitate communication between users via the Internet, from words extracted from text information included in communication information It is possible to extract important words representing the characteristics of communication contents.

（６）本発明は、（５）の話題語抽出装置について、前記重要語抽出手段が、前記形態素解析手段で抽出された単語から、前記指示代名詞辞書、および前記挨拶辞書に格納されている単語を除いて、重要語候補を抽出し、抽出された重要語候補の単語毎に、当該単語を含むテキスト情報のコミュニケーション情報の発信日時に基づいて特定される当該単語の時節と当該単語との組み合わせが前記時節別単語辞書に記憶されているか否かを判断し、前記時節別単語辞書に記憶されていない単語を重要語として抽出することを特徴とする話題語抽出装置を提案している。 (6) The present invention relates to the topic word extracting device of (5), wherein the important word extracting means stores the words stored in the demonstrative pronoun dictionary and the greeting dictionary from the words extracted by the morpheme analyzing means. A key word candidate is extracted, and for each word of the extracted key word candidate, a combination of the word time period and the word specified based on the transmission date and time of communication information of text information including the word The topic word extracting device is characterized in that it is determined whether or not is stored in the time-dependent word dictionary and a word not stored in the time-based word dictionary is extracted as an important word.

この発明によれば、重要語抽出手段が、形態素解析手段で抽出された単語から、指示代名詞辞書、および挨拶辞書に格納されている単語を除いて、重要語候補を抽出し、抽出された重要語候補の単語毎に、単語を含むコミュニケーション情報の発信日時に基づいて特定される単語の時節と単語との組み合わせが時節別単語辞書に記憶されているか否かを判断し、時節別単語辞書に記憶されていない単語を重要語として抽出する。したがって、時節別単語辞書を用いる前に、指示代名詞辞書および挨拶辞書に格納されている単語を除くことによって、時節を特定する単語の数を減すことができ、その結果効率よく重要語を抽出することができる。 According to this invention, the important word extracting unit extracts the important word candidates from the words extracted by the morpheme analyzing unit, excluding the words stored in the demonstrative pronoun dictionary and the greeting dictionary, and the extracted important words For each word candidate word, it is determined whether or not the combination of the time and word of the word specified based on the transmission date and time of communication information including the word is stored in the time-dependent word dictionary, and the time-dependent word dictionary Extract words that are not stored as important words. Therefore, before using the word-by-period dictionary, you can reduce the number of words that specify the time by removing words stored in the pronoun dictionary and greeting dictionary, and as a result, extract important words efficiently. can do.

（７）本発明は、（５）または（６）の話題語抽出装置について、前記挨拶に用いる単語には、会話において本題に入る前や、前記会話の終了時に交わされる雑談に用いる単語も含むことを特徴とする話題語抽出装置を提案している。 (7) In the topic word extraction device according to (5) or (6), the word used for the greeting includes a word used for chatting before entering the main topic in conversation or at the end of the conversation. We have proposed a topic word extraction device characterized by this.

この発明によれば、挨拶に用いる単語には、会話において本題に入る前や、会話の終了時に交わされる雑談に用いる単語も含む。したがって、広い意味で挨拶に含まれる、本題に入る前に互いに関する情報や天候や前後の無関係な雑談や会話の終了時に別れる場合に行われる雑談、に用いられる単語を挨拶辞書に含めることにより、インターネットを介したユーザ間のコミュニケーションを円滑にするために用いられるが、重要語でない単語を除くことにより、コミュニケーションの内容の特徴を表す重要語を抽出することができる。 According to the present invention, the word used for greeting includes the word used for chatting before entering the main topic in conversation or at the end of conversation. Therefore, by including words in the greeting dictionary in a broad sense, the word used for information related to each other before the main subject, the weather, irrelevant chats before and after, and chats that occur when the conversation ends, It is used to facilitate communication between users via the Internet. By removing words that are not important words, it is possible to extract important words that represent the characteristics of the content of communication.

（８）本発明は、（５）から（７）の話題語抽出装置について、前記時節は、季節、曜日、時間帯であることを特徴とする話題語抽出装置を提案している。 (8) The present invention proposes a topic word extraction device (5) to (7), wherein the time is a season, a day of the week, or a time zone.

この発明によれば、時節は、季節、曜日、および時間帯である。したがって、インターネットを介したユーザ間のコミュニケーションを円滑にするためによく用いられる、季節、曜日、および時間帯に関連する単語を除き、コミュニケーションの内容の特徴を表す重要語を抽出することができる。 According to the present invention, the time is the season, day of the week, and time zone. Therefore, it is possible to extract important words representing the characteristics of the contents of communication, excluding words related to seasons, days of the week, and time zones, which are often used to facilitate communication between users via the Internet.

（８）本発明は、コミュニケーション情報取得手段、形態素解析手段、重要度算出手段、ユニークユーザ数取得手段、重み付け係数算出手段、および話題語抽出手段を備え、インターネットを介したユーザ間のコミュニケーションのために発信されたコミュニケーション情報から話題語を抽出する話題語抽出装置における話題語抽出方法であって、前記コミュニケーション情報取得手段が、前記コミュニケーション情報を蓄積している１以上のコミュニケーション情報蓄積サーバから、所定期間のコミュニケーション情報を取得する第１のステップ（例えば、図２のステップＳ１）と、前記形態素解析手段が、前記第１のステップで取得したコミュニケーション情報に含まれるテキスト情報を形態素解析し、単語を抽出する第２のステップ（例えば、図２のステップＳ２）と、前記重要度算出手段が、前記形態素解析手段で抽出された単語毎の重要度を算出する第３のステップ（例えば、図１のステップＳ３）と、前記ユニークユーザ数取得手段が、前記第１のステップで取得されたコミュニケーション情報を参照して、前記第３のステップで抽出された単語毎に、各単語を前記テキスト情報に含むコミュニケーション情報を発信したユニークユーザ数を取得する第４のステップ（例えば、図２のステップＳ４）と、前記重み付け係数算出手段が、前記第４のステップで取得されたユニークユーザ数と、前記第１のステップで取得されたコミュニケーション情報の数とに基づいて、ユニークユーザを考慮する重み付け係数を算出する第５のステップ（例えば、図２のステップＳ５）と、前記話題語抽出手段が、前記第３のステップで算出された単語毎の重要度と、前記第５のステップで算出された重み付け係数とに基づいて、前記第３のステップで抽出された単語から話題語を抽出する第６のステップ（例えば、図２のステップＳ６）と、を含むことを特徴とする話題語抽出方法を提案している。 (8) The present invention comprises communication information acquisition means, morpheme analysis means, importance calculation means, unique user number acquisition means, weighting coefficient calculation means, and topic word extraction means for communication between users via the Internet. A topic word extraction method in a topic word extraction device for extracting a topic word from communication information transmitted to the communication information, wherein the communication information acquisition unit is configured to receive a predetermined value from one or more communication information storage servers storing the communication information. A first step (for example, step S1 in FIG. 2) for acquiring communication information of a period, and the morpheme analyzing means morphologically analyze text information included in the communication information acquired in the first step, Second step to extract For example, step S2) in FIG. 2, the third step (for example, step S3 in FIG. 1) in which the importance calculation unit calculates the importance for each word extracted by the morpheme analysis unit, and the unique A unique user who has transmitted the communication information including each word in the text information for each word extracted in the third step with reference to the communication information acquired in the first step. A fourth step of acquiring a number (for example, step S4 in FIG. 2), the weighting coefficient calculating means, the number of unique users acquired in the fourth step, and the communication acquired in the first step. 5th step (for example, step S5 of FIG. 2) which calculates the weighting coefficient which considers a unique user based on the number of information And the topic word extraction means is extracted in the third step based on the importance for each word calculated in the third step and the weighting coefficient calculated in the fifth step. A topic word extraction method characterized by including a sixth step (for example, step S6 in FIG. 2) for extracting a topic word from a word is proposed.

この発明によれば、まず、第１のステップにおいて、コミュニケーション情報取得手段が、コミュニケーション情報を蓄積している１以上のコミュニケーション情報蓄積サーバから、所定期間のコミュニケーション情報を取得する。次に、第２のステップにおいて、形態素解析手段が、第１のステップで取得したコミュニケーション情報に含まれるテキスト情報を形態素解析し、単語を抽出する。次に、第３のステップにおいて、重要度算出手段が、形態素解析手段で抽出された単語毎の重要度を算出する。次に、第４のステップにおいて、ユニークユーザ数取得手段が、第１のステップで取得されたコミュニケーション情報を参照して、第３のステップで抽出された単語毎に、各単語をテキスト情報に含むコミュニケーション情報を発信したユニークユーザ数を取得する。次に、第５のステップにおいて、重み付け係数算出手段が、第４のステップで取得されたユニークユーザ数と、第１のステップで取得されたコミュニケーション情報の数とに基づいて、ユニークユーザを考慮する重み付け係数を算出する。次に、第６のステップにおいて、話題語抽出手段が、第３のステップで算出された単語毎の重要度と、第５のステップで算出された重み付け係数とに基づいて、第３のステップで抽出された単語から話題語を抽出する。したがって、インターネットを介したユーザ間のコミュニケーションのために発信されたコミュニケーション情報に含まれるテキスト情報の単語の中から、ユニークユーザを考慮した重要度に基づいて話題語を抽出することにより、インターネットを介したユーザ間のコミュニケーションにおいて、多くのユーザが話題にしている真の話題語を抽出することができる。 According to this invention, first, in the first step, the communication information acquisition means acquires communication information for a predetermined period from one or more communication information storage servers that store communication information. Next, in the second step, the morpheme analyzing means performs morphological analysis on the text information included in the communication information acquired in the first step, and extracts words. Next, in a third step, the importance calculation means calculates the importance for each word extracted by the morpheme analysis means. Next, in the fourth step, the unique user number acquisition means refers to the communication information acquired in the first step and includes each word in the text information for each word extracted in the third step. Get the number of unique users who sent communication information. Next, in the fifth step, the weighting coefficient calculation means considers unique users based on the number of unique users acquired in the fourth step and the number of communication information acquired in the first step. A weighting coefficient is calculated. Next, in the sixth step, the topic word extracting means performs the third step based on the importance for each word calculated in the third step and the weighting coefficient calculated in the fifth step. Extract topic words from the extracted words. Therefore, by extracting topic words from the words of text information included in communication information transmitted for communication between users via the Internet based on the importance considering the unique user, In the communication between the users, it is possible to extract the true topic words that many users are talking about.

（９）本発明は、コミュニケーション情報取得手段、形態素解析手段、重要度算出手段、ユニークユーザ数取得手段、重み付け係数算出手段、および話題語抽出手段を備え、インターネットを介したユーザ間のコミュニケーションのために発信されたコミュニケーション情報から話題語を抽出する話題語抽出装置における話題語抽出方法をコンピュータに実行させるためのプログラムであって、前記コミュニケーション情報取得手段が、前記コミュニケーション情報を蓄積している１以上のコミュニケーション情報蓄積サーバから、所定期間のコミュニケーション情報を取得する第１のステップ（例えば、図２のステップＳ１）と、前記形態素解析手段が、前記第１のステップで取得したコミュニケーション情報に含まれるテキスト情報を形態素解析し、単語を抽出する第２のステップ（例えば、図２のステップＳ２）と、前記重要度算出手段が、前記形態素解析手段で抽出された単語毎の重要度を算出する第３のステップ（例えば、図１のステップＳ３）と、前記ユニークユーザ数取得手段が、前記第１のステップで取得されたコミュニケーション情報を参照して、前記第３のステップで抽出された単語毎に、各単語を前記テキスト情報に含むコミュニケーション情報を発信したユニークユーザ数を取得する第４のステップ（例えば、図２のステップＳ４）と、前記重み付け係数算出手段が、前記第４のステップで取得されたユニークユーザ数と、前記第１のステップで取得されたコミュニケーション情報の数とに基づいて、ユニークユーザを考慮する重み付け係数を算出する第５のステップ（例えば、図２のステップＳ５）と、前記話題語抽出手段が、前記第３のステップで算出された単語毎の重要度と、前記第５のステップで算出された重み付け係数とに基づいて、前記第３のステップで抽出された単語から話題語を抽出する第６のステップ（例えば、図２のステップＳ６）と、コンピュータに実行させるためのプログラムを提案している。 (9) The present invention comprises communication information acquisition means, morpheme analysis means, importance calculation means, unique user number acquisition means, weighting coefficient calculation means, and topic word extraction means for communication between users via the Internet. A program for causing a computer to execute a topic word extraction method in a topic word extraction device that extracts a topic word from communication information transmitted to a computer, wherein the communication information acquisition means stores the communication information. A first step of acquiring communication information for a predetermined period from the communication information storage server (for example, step S1 in FIG. 2), and a text included in the communication information acquired by the morpheme analyzing unit in the first step. information A second step (for example, step S2 in FIG. 2) of analyzing the morphological element and extracting a word; and a third step of calculating the importance level for each word extracted by the morpheme analyzing means. Step (for example, step S3 in FIG. 1) and the unique user number acquisition means refer to the communication information acquired in the first step, and for each word extracted in the third step, A fourth step (for example, step S4 in FIG. 2) for acquiring the number of unique users who have transmitted communication information including a word in the text information, and the weighting coefficient calculation means is the unique number acquired in the fourth step. Based on the number of users and the number of pieces of communication information acquired in the first step, a weighting factor for calculating a weighting factor considering a unique user is calculated. 5 (for example, step S5 in FIG. 2), and the topic word extraction means determines the importance for each word calculated in the third step and the weighting coefficient calculated in the fifth step. Based on this, a sixth step (for example, step S6 in FIG. 2) for extracting a topic word from the word extracted in the third step and a program for causing a computer to execute are proposed.

本発明によれば、インターネットを介したユーザ間のコミュニケーションのために発信されたコミュニケーション情報に含まれるテキスト情報の単語の中から、ユニークユーザ数、情報の伝搬度合、ユーザ間のリンク関係の少なくとも１つを考慮した重要度に基づいて話題語を抽出することにより、インターネットを介したユーザ間のコミュニケーションにおいて、多くのユーザが話題にしている真の話題語を抽出することができる。

According to the present invention, at least one of the number of unique users, the degree of information propagation, and the link relationship between users among the words of the text information included in the communication information transmitted for communication between users via the Internet. By extracting the topic words based on the importance taking into account one, it is possible to extract the true topic words that many users are talking about in communication between users via the Internet.

本発明の第１の実施形態に係る話題語抽出装置の機能構成を示す図である。It is a figure which shows the function structure of the hot topic extraction apparatus which concerns on the 1st Embodiment of this invention. 本発明の第１の実施形態に係る話題語抽出処理のフロー図である。It is a flowchart of the topic word extraction process which concerns on the 1st Embodiment of this invention. 本発明の第２の実施形態に係る話題語抽出装置の機能構成を示す図である。It is a figure which shows the function structure of the hot topic extraction apparatus which concerns on the 2nd Embodiment of this invention. 本発明の第３の実施形態に係る話題語抽出装置の機能構成を示す図である。It is a figure which shows the function structure of the hot topic extraction apparatus which concerns on the 3rd Embodiment of this invention. 本発明の第３の実施形態に係る重要語を抽出処理の一例を示すフロー図である。It is a flowchart which shows an example of the extraction process of the important word which concerns on the 3rd Embodiment of this invention.

以下、図面を用いて、本発明の実施形態について詳細に説明する。なお、本実施形態における構成要素は適宜、既存の構成要素等との置き換えが可能であり、また、他の既存の構成要素との組み合わせを含むさまざまなバリエーションが可能である。したがって、本実施形態の記載をもって、特許請求の範囲に記載された発明の内容を限定するものではない。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings. Note that the components in the present embodiment can be appropriately replaced with existing components and the like, and various variations including combinations with other existing components are possible. Therefore, the description of the present embodiment does not limit the contents of the invention described in the claims.

＜第１の実施形態＞
図１および図２を用いて、本発明の第１の実施形態について説明する。 <First Embodiment>
A first embodiment of the present invention will be described with reference to FIGS. 1 and 2.

＜話題語抽出装置の機能構成＞
図１は、本発明の第１の実施形態に係る話題語抽出装置１００の機能構成を示す図である。話題語抽出装置１００は、通信ネットワークを介してコミュニケーション情報蓄積サーバ１０と接続される。 <Functional configuration of topic word extraction device>
FIG. 1 is a diagram showing a functional configuration of a topic word extraction device 100 according to the first embodiment of the present invention. The topic word extraction device 100 is connected to the communication information storage server 10 via a communication network.

コミュニケーション情報蓄積サーバ１０は、インターネットを介したユーザ間のコミュニケーションのために発信されたコミュニケーション情報を蓄積、管理する。ここで、コミュニケーション情報は、例えば、Ｅメールや、ブログ、ＳＮＳおよびＴＷＩＴＴＥＲ等のソーシャル・ネット・ワーキングサービスへの投稿情報や、インスタントメッセージである。本実施形態のコミュニケーション情報には、コミュニケーションの内容を示すテキスト情報、コミュニケーション情報が発信された日時（以下、発信日時とする）、およびコミュニケーション情報を発信したユーザを一意に特定する情報（例えば、ユーザＩＤやユーザ名）が含まれる。 The communication information accumulation server 10 accumulates and manages communication information transmitted for communication between users via the Internet. Here, the communication information is, for example, e-mail, post information to social networking services such as blogs, SNS and TWITTER, and instant messages. The communication information of the present embodiment includes text information indicating the content of communication, date and time when communication information was transmitted (hereinafter referred to as transmission date and time), and information that uniquely identifies the user who transmitted the communication information (for example, user ID and user name).

話題語抽出装置１００は、図１に示すように、コミュニケーション情報取得部１１０、形態素解析部１２０、重要度算出部１３０、ユニークユーザ数取得部１４０、重み付け係数算出部１５０、および話題語抽出部１６０を備える。 As shown in FIG. 1, the topic word extraction device 100 includes a communication information acquisition unit 110, a morpheme analysis unit 120, an importance calculation unit 130, a unique user number acquisition unit 140, a weighting coefficient calculation unit 150, and a topic word extraction unit 160. Is provided.

コミュニケーション情報取得部１１０は、所定期間について、コミュニケーション情報蓄積サーバ１０からコミュニケーション情報を取得する。例えば、コミュニケーション情報蓄積サーバを提供している事業者が提供しているＡＰＩを利用して、コミュニケーション情報蓄積サーバ１０からコミュニケーション情報を取得する。ここで、所定期間は、コミュニケーション情報の分析を行う分析者が任意に設定することができ、直近の１カ月間や、昨年１２カ月間等と設定することができる。 The communication information acquisition unit 110 acquires communication information from the communication information storage server 10 for a predetermined period. For example, the communication information is acquired from the communication information storage server 10 by using an API provided by a provider providing the communication information storage server. Here, the predetermined period can be arbitrarily set by an analyst who analyzes the communication information, and can be set to the most recent one month, last 12 months, or the like.

形態素解析部１２０は、コミュニケーション情報取得部１１０で取得されたコミュニケーション情報を形態素解析し、単語を抽出する。なお、形態素解析部１２０は、形態素解析部１２０で形態素解析された単語から、予め設定された特定の品詞を抽出してもよい。 The morphological analysis unit 120 performs morphological analysis on the communication information acquired by the communication information acquisition unit 110 and extracts words. Note that the morpheme analysis unit 120 may extract a specific part of speech that is set in advance from words that have been analyzed by the morpheme analysis unit 120.

重要度算出部１３０は、形態素解析部１２０で抽出された単語それぞれの重要度を算出する。重要度を算出する方法としては、ｔｆ−ｉｄｆ（ＴｅｒｍＦｒｅｑｕｅｎｃｙＩｎｖｅｒｓｅＤｏｃｕｍｅｎｔＦｒｅｑｕｅｎｃｙ）を計算し、そのスコアを重要度として利用する方法がある。 The importance calculation unit 130 calculates the importance of each word extracted by the morphological analysis unit 120. As a method of calculating importance, there is a method of calculating tf-idf (Term Frequency Inverse Document Frequency) and using the score as importance.

ユニークユーザ数取得部１４０は、まず、形態素解析部１２０で抽出された単語毎に、コミュニケーション情報取得部１１０で取得したコミュニケーション情報から、各単語を含むテキスト情報のコミュニケーション情報を抽出する。そして、ユニークユーザ数取得部１４０は、単語毎に、抽出したコミュニケーション情報のユーザ情報に基づいて、ユニークユーザ数をカウントし、取得する。ここで、ユニークユーザ数とは、単語を含むテキスト情報のコミュニケーション情報を発信したユニークなユーザの数であって、同じユーザが同じ単語を含むコミュニケーション情報を何度してもまとめて１回の発信としてカウントした、単語を使用したユーザの正味人数である。 The unique user number acquisition unit 140 first extracts communication information of text information including each word from the communication information acquired by the communication information acquisition unit 110 for each word extracted by the morphological analysis unit 120. And the unique user number acquisition part 140 counts and acquires the number of unique users for every word based on the user information of the extracted communication information. Here, the number of unique users is the number of unique users who have transmitted communication information of text information including words, and the same user can transmit communication information including the same words once and all times This is the net number of users who used the word.

重み付け係数算出部１５０は、まず、形態素解析部１２０で抽出された単語毎に、コミュニケーション情報取得部１１０で取得したコミュニケーション情報から、各単語を含むコミュニケーション情報の数（以下、コミュニケーション情報数とする）をカウントし、取得する。次に、重み付け係数算出部１５０は、形態素解析部１２０で抽出された単語毎に、数１に示すように、取得したコミュニケーション情報数で、ユニークユーザ数取得部１４０で取得されたユニークユーザ数を割った商を重み付け係数αとして算出する。重み付け係数αは、ユニークユーザ数が少ない、すなわち、特定ユーザの使用頻度が高い単語に場合には、値が小さくなり、一方、ユニークユーザ数が大きい、すなわち、特定ユーザの使用頻度が低く、多くのユニークユーザが使用している単語の場合には、値が大きくなる。 For each word extracted by the morphological analysis unit 120, the weighting coefficient calculation unit 150 first determines the number of communication information including each word from the communication information acquired by the communication information acquisition unit 110 (hereinafter referred to as communication information number). Count and get. Next, for each word extracted by the morpheme analyzer 120, the weighting coefficient calculator 150 calculates the number of unique users acquired by the number of unique users acquired by the number of pieces of communication information acquired as shown in Equation 1. The divided quotient is calculated as a weighting coefficient α. The weighting coefficient α has a small value when the number of unique users is small, that is, a word that is frequently used by a specific user, whereas the weighting coefficient α is large, that is, the number of unique users is large, that is, the frequency of use of a specific user is low. In the case of a word used by a unique user, the value becomes large.

話題語抽出部１６０は、重要度算出部１３０で算出された重要度に、重み付け係数算出部１５０で算出された重み付け係数αを乗算し、重み付け係数αが乗算された重要度に基づいて、形態素解析部１２０で抽出された単語の中から話題語を抽出する。例えば、話題語抽出部１６０は、重み付け係数αが乗算された重要度が、予め設定されたしきい値以上である単語を話題語として抽出する。また、話題語抽出部１６０は、形態素解析部１２０で抽出された単語を、重み付け係数αが乗算された重要度が高い順に並べ、上位から所定の数の単語を話題語としてもよい。 The topic word extraction unit 160 multiplies the importance calculated by the importance calculation unit 130 by the weighting coefficient α calculated by the weighting coefficient calculation unit 150, and based on the importance obtained by multiplying the weighting coefficient α, the morpheme A topic word is extracted from the words extracted by the analysis unit 120. For example, the topic word extraction unit 160 extracts words whose importance level multiplied by the weighting coefficient α is equal to or higher than a preset threshold value as topic words. The topic word extraction unit 160 may arrange the words extracted by the morpheme analysis unit 120 in descending order of importance multiplied by the weighting coefficient α, and may use a predetermined number of words from the top as topic words.

話題語を抽出する際に用いる重要度として、ｔｆ−ｉｄｆ等により算出された重要度にユニークユーザを考慮した重み付け係数αを乗算した値を用いることにより、特定ユーザの使用頻度が高い単語の重要度を下げ、一方、多くのユニークユーザが使用している単語の重要度を上げることができる。その結果、インターネットを介したユーザ間のコミュニケーションにおいて、多くのユーザが使用している真の話題語を抽出することができる。 By using a value obtained by multiplying the importance calculated by tf-idf or the like with a weighting coefficient α taking into account the unique user as the importance used when extracting the topic word, the importance of a word frequently used by a specific user is used. On the other hand, the importance of words used by many unique users can be increased. As a result, true topic words used by many users can be extracted in communication between users via the Internet.

＜話題語抽出処理フロー＞
図２は、本発明の第１の実施形態に係る話題語抽出処理フローを示す図である。 <Topic word extraction processing flow>
FIG. 2 is a diagram showing a topic word extraction processing flow according to the first embodiment of the present invention.

まず、ステップＳ１において、コミュニケーション情報取得部１１０は、コミュニケーション情報蓄積サーバ１０からコミュニケーション情報を取得する。 First, in step S <b> 1, the communication information acquisition unit 110 acquires communication information from the communication information accumulation server 10.

次に、ステップＳ２において、形態素解析部１２０は、ステップＳ１で取得したコミュニケーション情報に含まれるテキスト情報を形態素解析し、単語を抽出する。 Next, in step S2, the morphological analysis unit 120 performs morphological analysis on the text information included in the communication information acquired in step S1, and extracts words.

次に、ステップＳ３において、重要度算出部１３０は、ステップＳ２で抽出された単語の重要度を算出する。 Next, in step S3, the importance calculation unit 130 calculates the importance of the word extracted in step S2.

次に、ステップＳ４において、ユニークユーザ数取得部１４０は、ステップＳ２で抽出された単語毎に、単語を含むテキスト情報のコミュニケーション情報を発信したユニークユーザ数をカウントし、取得する。 Next, in step S4, the unique user number acquisition unit 140 counts and acquires the number of unique users who have transmitted the communication information of the text information including the words for each word extracted in step S2.

次に、ステップＳ５において、重み付け係数算出部１５０は、ステップＳ４で取得されたユニークユーザ数を、ステップＳ１で取得されたコミュニケーション情報の数で割った商を、ユニークユーザを考慮した重み付け係数αとして算出する。 Next, in step S5, the weighting coefficient calculation unit 150 uses a quotient obtained by dividing the number of unique users acquired in step S4 by the number of communication information acquired in step S1 as a weighting coefficient α in consideration of the unique users. calculate.

次に、ステップＳ６において、話題語抽出部１６０は、ステップＳ３で算出された重要度に、ステップＳ５で算出された重み付け係数αを乗算し、重み付け係数αが乗算された重要度に基づいて、ステップＳ２で抽出された単語の中から話題語を抽出する。 Next, in step S6, the topic word extraction unit 160 multiplies the importance calculated in step S3 by the weighting coefficient α calculated in step S5, and based on the importance obtained by multiplying the weighting coefficient α, A topic word is extracted from the words extracted in step S2.

以上説明したように、本実施形態によれば、インターネットを介したユーザ間のコミュニケーションにおいて発信されたコミュニケーション情報のテキスト情報に含まれる単語の中から、ｔｆ−ｉｄｆ等により算出された重要度にユニークユーザ数を考慮した重み付け係数αを乗算した重要度に基づいて、話題語を抽出することにより、インターネットを介したユーザ間のコミュニケーションにおいて、多くのユーザが話題にしている真の話題語を抽出することができる。 As described above, according to the present embodiment, the importance calculated by tf-idf or the like is unique from the words included in the text information of communication information transmitted in communication between users via the Internet. By extracting the topic words based on the importance obtained by multiplying the weighting coefficient α considering the number of users, the true topic words that many users are talking about are extracted in communication between users via the Internet. be able to.

＜第２の実施形態＞
図３を用いて、本発明の第２の実施形態について説明する。なお、本実施形態における話題語抽出装置は、単語の伝搬度合や、単語を使用しているユーザのリンク関係も加味して、話題語を抽出する。なお、第１の実施形態と同一の符号を付す構成要素については、同一の機能を有することから、その詳細な説明は省略する。 <Second Embodiment>
A second embodiment of the present invention will be described with reference to FIG. In addition, the topic word extraction apparatus in this embodiment extracts a topic word in consideration of the propagation degree of a word and the link relation of a user who uses the word. In addition, about the component which attaches | subjects the same code | symbol as 1st Embodiment, since it has the same function, the detailed description is abbreviate | omitted.

図３は、本発明の第２の実施形態に係る話題語抽出装置２００の機能構成を示す図である。話題語抽出装置２００は、通信ネットワークを介してコミュニケーション情報蓄積サーバ１０と接続される。 FIG. 3 is a diagram showing a functional configuration of the topic word extraction device 200 according to the second embodiment of the present invention. The topic word extraction device 200 is connected to the communication information storage server 10 via a communication network.

本実施形態において、コミュニケーション情報蓄積サーバ１０が記憶、管理するコミュニケーション情報には、コミュニケーション情報が発信された日時（以下、発信日時とする）、およびコミュニケーション情報を発信したユーザを一意に特定する情報（例えば、ユーザＩＤやユーザ名）とともに、再発信情報が含まれている。ここで、再発信情報とは、コミュニケーション情報のテキスト情報に、他のユーザのコミュニケーション情報のテキスト情報の全部または一部が含まれているか否かを示す情報であって、例えば、ＴＷＩＴＴＥＲのリツイートであるか否かを示す情報である。 In the present embodiment, the communication information stored and managed by the communication information storage server 10 includes the date and time when the communication information was transmitted (hereinafter referred to as the transmission date and time) and information that uniquely identifies the user who transmitted the communication information ( For example, retransmission information is included together with a user ID and a user name. Here, the retransmission information is information indicating whether or not the text information of the communication information includes all or a part of the text information of the communication information of other users. It is information indicating whether or not there is.

話題語抽出装置２００は、図１に示すように、コミュニケーション情報取得部１１０、形態素解析部１２０、重要度算出部１３０、ユニークユーザ数取得部１４０、再発信コミュニケーション情報数取得部２１０、リンク関係取得部２２０、リンクユーザ数取得部２３０、重み付け係数算出部１５１、および話題語抽出部１６１を備える。 As shown in FIG. 1, the topic word extraction apparatus 200 includes a communication information acquisition unit 110, a morpheme analysis unit 120, an importance calculation unit 130, a unique user number acquisition unit 140, a retransmitted communication information number acquisition unit 210, and a link relationship acquisition. Unit 220, link user number acquisition unit 230, weighting coefficient calculation unit 151, and topic word extraction unit 161.

再発信コミュニケーション情報数取得部２１０は、コミュニケーション情報取得部１１０で取得されたコミュニケーション情報に含まれる再発信情報に基づいて、他のユーザにより発信されたコミュニケーション情報を再発信しているコミュニケーション情報の数（以下、再発信コミュニケーション情報数という）を取得する。 The number of retransmitted communication information acquisition unit 210 is the number of communication information retransmitting communication information transmitted by other users based on the retransmitted information included in the communication information acquired by communication information acquiring unit 110. (Hereinafter referred to as the number of retransmitted communication information).

リンク関係取得部２２０は、コミュニケーション情報蓄積サーバ１０から、ユーザ間のリンク関係を取得する。例えば、ユーザＢはユーザＡとリンク関係を結んでいるといった情報である。なお、何階層目までのリンク関係を取得するかは任意であるが、２、３階層目までが望ましい。 The link relationship acquisition unit 220 acquires a link relationship between users from the communication information storage server 10. For example, the information is that user B has a link relationship with user A. Note that it is arbitrary how many levels of link relations are acquired, but it is desirable to have up to the second and third levels.

リンクユーザ数取得部２３０は、まず、抽出したコミュニケーション情報のユーザ情報に基づいて、単語毎に、単語を含むコミュニケーション情報を発信したユニークユーザを抽出する。次に、リンクユーザ数取得部２３０は、単語毎に、抽出したユニークユーザを、リンク関係取得部２２０で取得したリンク関係に基づいて、リンクユーザと被リンクユーザとに分ける。ここで、リンクユーザは、他のユーザに対してリンクを結んでいるユーザであって、例えば、ＴＷＩＴＴＥＲのフォロワーである。一方、被リンクユーザは、他のユーザからリンク関係を結ばれたユーザであって、例えば、ＴＷＩＴＴＥＲのフォローされている人である。そして、リンクユーザ数取得部２３０は、単語毎に、リンクユーザの数（以下、リンクユーザ数という）をカウントし、取得する。 The link user number acquisition unit 230 first extracts a unique user who has transmitted communication information including a word for each word, based on the user information of the extracted communication information. Next, the link user number acquisition unit 230 divides the extracted unique user into a link user and a linked user for each word based on the link relationship acquired by the link relationship acquisition unit 220. Here, the link user is a user who has linked to other users, and is, for example, a follower of TWITTER. On the other hand, the linked user is a user who has a link relationship with another user, for example, a person who is followed by TWITTER. And the link user number acquisition part 230 counts and acquires the number of link users (henceforth a link user number) for every word.

重み付け係数算出部１５１は、まず、形態素解析部１２０で抽出された単語毎に、コミュニケーション情報取得部１１０で取得したコミュニケーション情報から、各単語を含むコミュニケーション情報を検索し、単語毎のコミュニケーション情報数を取得する。次に、重み付け係数算出部１５１は、形態素解析部１２０で抽出された単語毎に、数２に示すように、取得したコミュニケーション情報数で、再発信コミュニケーション情報数取得部２１０で取得された再発信コミュニケーション情報数を割った商を、重み付け係数βとして算出する。重み付け係数βは、再発信コミュニケーション情報数が少ない、すなわち、コミュニケーション情報があまり伝搬していない単語の場合には、値が小さくなり、一方、再発信コミュニケーション情報数が大きい、すなわち、コミュニケーション情報が広く伝搬している単語の場合には、値が大きくなる。 First, the weighting coefficient calculation unit 151 searches for communication information including each word from the communication information acquired by the communication information acquisition unit 110 for each word extracted by the morpheme analysis unit 120, and determines the number of communication information for each word. get. Next, for each word extracted by the morphological analysis unit 120, the weighting coefficient calculation unit 151 retransmits the number of communication information acquired by the retransmission communication information number acquisition unit 210 as shown in Equation 2. A quotient obtained by dividing the number of communication information is calculated as a weighting coefficient β. The weighting coefficient β is small when the number of retransmitted communication information is small, that is, when the communication information is not propagated so much, while the value of the retransmitted communication information is large, that is, the communication information is wide. In the case of a propagating word, the value is large.

また、重み付け係数算出部１５１は、形態素解析部１２０で抽出された単語毎に、数３に示すように、ユニークユーザ数取得部１４０で取得されたユニークユーザ数で、リンクユーザ数取得部２３０で取得されたリンクユーザ数を割った商を重み付け係数γとして算出する。重み付け係数γは、リンクユーザ数が多い、すなわち、リンク関係にあるユーザ間での使用頻度が高い単語の場合には、値が大きくなり、一方、リンクユーザ数が小さい、すなわち、リンク関係にあるユーザ間以外での使用頻度が高い単語の場合には、値が小さくなる。 Further, the weighting coefficient calculation unit 151 uses the number of unique users acquired by the unique user number acquisition unit 140 for each word extracted by the morpheme analysis unit 120 and the number of link users acquisition unit 230 as shown in Equation 3. A quotient obtained by dividing the number of acquired link users is calculated as a weighting coefficient γ. The weighting factor γ has a large value in the case of a word having a large number of link users, that is, a word that is frequently used among users in a link relationship, while the number of link users is small, that is, in a link relationship. In the case of a word that is frequently used except between users, the value is small.

話題語抽出部１６１は、重要度算出部１３０で算出された重要度に、重み付け係数算出部１５１で算出された重み付け係数αおよび重み付け係数βを乗算し、重み付け係数αおよびβが乗算された重要度に基づいて、形態素解析部１２０で抽出された単語の中から話題語を抽出する。例えば、話題語抽出部１６１は、重み付け係数αおよびβが乗算された重要度が、予め設定されたしきい値以上である単語を話題語として抽出する。また、話題語抽出部１６１は、形態素解析部１２０で抽出された単語を、重み付け係数αおよびβが乗算された重要度が高い順に並べ、上位から所定の数の単語を話題語としてもよい。 The topic word extraction unit 161 multiplies the importance calculated by the importance calculation unit 130 by the weighting coefficient α and the weighting coefficient β calculated by the weighting coefficient calculation unit 151, and the weighted coefficients α and β multiplied by the importance Based on the degree, the topic word is extracted from the words extracted by the morphological analysis unit 120. For example, the topic word extraction unit 161 extracts a word whose importance level multiplied by the weighting coefficients α and β is equal to or higher than a preset threshold value as a topic word. The topic word extraction unit 161 may arrange the words extracted by the morpheme analysis unit 120 in descending order of importance multiplied by the weighting coefficients α and β, and may use a predetermined number of words as the topic words from the top.

話題語を抽出する際に用いる重要度として、ｔｆ−ｉｄｆ等により算出された重要度にユニークユーザを考慮した重み付け係数αとともに、単語の伝搬度合を考慮する重み付け係数βを乗算した値を用いることにより、ユーザ間に伝搬していない単語の重要度を下げ、一方、ユーザ間に広く伝搬している単語の重要度を上げることができる。その結果、インターネットを介したユーザ間のコミュニケーションにおいて、多くのユーザが使用し、かつ広く伝搬している真の話題語を抽出することができる。 As the importance used when extracting the topic word, a value obtained by multiplying the importance calculated by tf-idf or the like with the weighting coefficient α considering the unique user and the weighting coefficient β considering the word propagation degree is used. This reduces the importance of words that are not propagated between users, while increasing the importance of words that are widely propagated between users. As a result, it is possible to extract true topic words that are used by many users and are widely propagated in communication between users via the Internet.

また、話題語抽出部１６１は、重要度算出部１３０で算出された重要度に、重み付け係数算出部１５１で算出された重み付け係数αおよび重み付け係数γを乗算し、重み付け係数αおよびγが乗算された重要度に基づいて、形態素解析部１２０で抽出された単語の中から話題語を抽出する。例えば、話題語抽出部１６１は、重み付け係数αおよびγが乗算された重要度が、予め設定されたしきい値以上である単語を話題語として抽出する。また、話題語抽出部１６１は、形態素解析部１２０で抽出された単語を、重み付け係数αおよびγが乗算された重要度が高い順に並べ、上位から所定の数の単語を話題語としてもよい。 Further, the topic word extraction unit 161 multiplies the importance calculated by the importance calculation unit 130 by the weighting coefficient α and the weighting coefficient γ calculated by the weighting coefficient calculation unit 151, and the weighting coefficients α and γ are multiplied. Based on the importance, the topic word is extracted from the words extracted by the morphological analysis unit 120. For example, the topic word extraction unit 161 extracts a word whose importance level multiplied by the weighting coefficients α and γ is equal to or more than a preset threshold value as a topic word. The topic word extraction unit 161 may arrange the words extracted by the morpheme analysis unit 120 in descending order of importance multiplied by the weighting coefficients α and γ, and may use a predetermined number of words as the topic words from the top.

話題語を抽出する際に用いる重要度として、ｔｆ−ｉｄｆ等により算出された重要度にユニークユーザを考慮した重み付け係数αとともに、ユーザ間のリンク関係を考慮する重み付け係数γを乗算した値を用いることにより、リンク関係にあるユーザ間以外での使用頻度が高い単語の重要度を下げ、一方、リンク関係にあるユーザ間の使用頻度が高い単語の重要度を上げることができる。その結果、インターネットを介したユーザ間のコミュニケーションにおいて、多くのユーザが使用し、かつリンク関係にあるユーザ間の使用頻度が高い単語、すなわち、一部のユーザで話題になっている単語であって、これからユーザ間に伝搬し話題になる可能性のある単語を話題語として抽出することができる。 As the importance used when extracting a topic word, a value obtained by multiplying the importance calculated by tf-idf or the like with a weighting coefficient α considering a unique user and a weighting coefficient γ considering a link relation between users is used. Accordingly, it is possible to reduce the importance of words that are frequently used except between users in a link relationship, while increasing the importance of words that are frequently used between users in a link relationship. As a result, in communication between users via the Internet, it is a word that is used by many users and is frequently used among users who are linked, that is, a word that is talked about by some users. From this, it is possible to extract words that are likely to become topics after being propagated between users as topic words.

なお、話題語抽出部１６１は、重み付け係数αとともに、重み付け係数βおよび重み付け係数γの両方を乗算した重要度を話題語の抽出に用いてもよい。 Note that the topic word extraction unit 161 may use the importance obtained by multiplying both the weighting coefficient β and the weighting coefficient γ together with the weighting coefficient α for topic word extraction.

以上説明したように、本実施形態によれば、インターネットを介したユーザ間のコミュニケーションにおいて発信されたコミュニケーション情報のテキスト情報に含まれる単語の中から、ｔｆ−ｉｄｆ等により算出された重要度に、ユニークユーザを考慮した重み付け係数αとともに書き込み情報の伝搬度合を考慮した重み付け係数βを乗算した重要度に基づいて話題語を抽出することにより、インターネットを介したユーザ間のコミュニケーションにおいて、ユーザ間に伝搬している真の話題語を抽出することができる。 As described above, according to the present embodiment, the importance calculated by tf-idf or the like from words included in the text information of communication information transmitted in communication between users via the Internet, By extracting topic words based on the importance obtained by multiplying the weighting coefficient β taking into account the degree of propagation of write information together with the weighting coefficient α taking into account the unique user, it is propagated between users in communication between users via the Internet. It is possible to extract true topic words.

更に、インターネットを介したユーザ間のコミュニケーションにおいて発信されたコミュニケーション情報のテキスト情報に含まれる単語の中から、ｔｆ−ｉｄｆ等により算出された重要度に、ユニークユーザを考慮した重み付け係数αとともにユーザ間のリンク関係を考慮した重み付け係数γを乗算した重要度に基づいて話題語を抽出することにより、インターネットを介したユーザ間のコミュニケーションにおいて、一部のユーザで話題になっている単語であって、これからユーザ間に伝搬し話題になる可能性のある単語を話題語として抽出することができる。 Furthermore, among the words included in the text information of the communication information transmitted in the communication between users via the Internet, the importance calculated by tf-idf etc. is used together with the weighting coefficient α considering the unique user. By extracting the topic word based on the importance obtained by multiplying the weighting coefficient γ considering the link relationship of the word, it is a word that has become a topic in some users in communication between users via the Internet, From this, it is possible to extract words that are propagated between users and become a topic as topic words.

＜第３の実施形態＞
図４および図５を用いて、本発明の第３の実施形態について説明する。 <Third Embodiment>
A third embodiment of the present invention will be described with reference to FIGS.

＜話題語抽出装置の機能構成＞
図４は、本発明の第３の実施形態に係る話題語抽出装置３００の機能構成を示す図である。話題語抽出装置３００は、通信ネットワークを介してコミュニケーション情報蓄積サーバ１０と接続される。 <Functional configuration of topic word extraction device>
FIG. 4 is a diagram showing a functional configuration of a topic word extraction device 300 according to the third embodiment of the present invention. The topic word extraction device 300 is connected to the communication information storage server 10 via a communication network.

話題語抽出装置３００は、図４に示すように、コミュニケーション情報取得部１１０、形態素解析部１２０、重要度算出部１３２、ユニークユーザ数取得部１４２、重み付け係数算出部１５２、話題語抽出部１６２、重要語抽出部３１０、および辞書記憶部３２０を備える。 As shown in FIG. 4, the topic word extraction device 300 includes a communication information acquisition unit 110, a morpheme analysis unit 120, an importance calculation unit 132, a unique user number acquisition unit 142, a weighting coefficient calculation unit 152, a topic word extraction unit 162, An important word extraction unit 310 and a dictionary storage unit 320 are provided.

重要語抽出部３１０は、形態素解析部１２０で形態素解析された単語から、予め用意された、辞書記憶部３２０に記憶されている単語を除くことによって、重要語を抽出する。重要語の抽出処理については後述する。 The keyword extraction unit 310 extracts the keyword by removing the words stored in the dictionary storage unit 320 prepared in advance from the words morphologically analyzed by the morphological analysis unit 120. The important word extraction process will be described later.

辞書記憶部３２０は、予め用意された辞書を記憶する。辞書には、インターネットを介したユーザ間のコミュニケーションを円滑にするために用いられるが、コミュニケーションの内容の特徴を表す重要語でない単語が格納されている。なお、重要語とは、コミュニケーションの内容の特徴を表す単語である。 The dictionary storage unit 320 stores a dictionary prepared in advance. The dictionary stores words that are used to facilitate communication between users via the Internet, but are not important words representing the characteristics of the content of communication. In addition, an important word is a word showing the characteristic of the content of communication.

本実施形態において、辞書記憶部３２０には、インターネットを介したユーザ間のコミュニケーションを円滑にするために用いられる単語であって、コミュニケーションにおいて重要語にはならない単語である指示代名詞、挨拶に用いられる単語、および時節関連単語をそれぞれ格納する、指示代名詞辞書３２１、挨拶辞書３２２、および時節別単語辞書３２３が記憶される。なお、辞書記憶部３２０に記憶される辞書は、追加および削除することができる。 In the present embodiment, the dictionary storage unit 320 is a word used to facilitate communication between users via the Internet, and is used as a pronoun or greeting that is a word that does not become an important word in communication. A pronoun pronoun dictionary 321, a greeting dictionary 322, and a time-dependent word dictionary 323 that store words and time-related words, respectively, are stored. The dictionary stored in the dictionary storage unit 320 can be added and deleted.

指示代名詞辞書３２１は、指示代名詞が格納され、例えば、テキスト情報が日本語の場合には、彼、彼女、これ、それ等が格納されている。 The demonstrative pronoun dictionary 321 stores demonstrative pronouns. For example, when the text information is Japanese, he, girlfriend, and the like are stored.

挨拶辞書３２２は、挨拶に用いられる単語、例えば、こんにちわ、さようなら、ありがとう等が格納されている。 The greeting dictionary 322 stores words used for greetings, such as hello, goodbye and thank you.

なお、挨拶に用いられる単語には、会話において本題に入る前や、会話の終了時に交わされる雑談に用いる単語を含んでもよい。コミュニケーション情報は、インターネットを介したユーザ間のコミュニケーションのために発信されるので、テキスト情報は会話文に近い。そのため、会話において一般的に行われる、本題に入る前に互いに関する情報や天候や本題の前後の無関係な雑談や会話の終了時に別れる場合に行われる雑談が、書き込み情報にも含まれることがよくある。これらの雑談に含まれる単語は、こんにちわ、さようなら等と同様に、インターネットを介したユーザ間のコミュニケーションを円滑にするために用いられる単語であって、コミュニケーション情報において重要語にはならない単語であるので、これらの単語も挨拶辞書３２２に含めることにより、コミュニケーション情報を円滑にするために用いられる単語を重要語から除くことができる。 Note that the words used for greetings may include words used for chatting before entering the main topic in conversation or at the end of conversation. Since the communication information is transmitted for communication between users via the Internet, the text information is close to a conversational sentence. For this reason, it is common for the written information to include information about each other, the weather, irrelevant chats before and after the main topic, and chats that occur at the end of the conversation before the main topic. is there. The words included in these chats are words that are used to facilitate communication between users via the Internet, and are not important words in communication information, as in the case of Hello, Goodbye, etc. These words are also included in the greeting dictionary 322, so that words used for smoothing communication information can be excluded from important words.

時節別単語辞書３２３は、時節毎に、時節に関連する単語を格納する。ここで、時節とは、時間帯、曜日、季節等である。時節別単語辞書３２３は、例えば、時間帯「朝」に対応付けて、例えば、朝に関連する「朝」や「朝食」等の単語を格納し、曜日「日曜日」に対応付けて、日曜日に関連する「日曜」や「休日」の単語を格納する。 The word-by-time dictionary 323 stores words related to the time for each time. Here, time is a time zone, a day of the week, a season, or the like. The hourly word dictionary 323 stores words such as “morning” and “breakfast” related to the morning, for example, in association with the time zone “morning”, and associates with the day of the week “Sunday” in Sunday. Stores the related words “Sunday” and “Holiday”.

ここで、辞書記憶部３２０に記憶されている指示代名詞辞書３２１、挨拶辞書３２２、および時節別単語辞書３２３を用いた、重要語抽出部３１０による重要語の抽出処理の一例について、図５を用いて、説明する。 Here, an example of the important word extraction processing by the important word extraction unit 310 using the pronoun pronoun dictionary 321, the greeting dictionary 322, and the time-dependent word dictionary 323 stored in the dictionary storage unit 320 will be described with reference to FIG. 5. I will explain.

まず、ステップＳ１１において、重要語抽出部３１０は、形態素解析部１２０で解析された単語から、指示代名詞辞書３２１に記憶されている単語を除く。 First, in step S <b> 11, the important word extraction unit 310 removes words stored in the demonstrative pronoun dictionary 321 from the words analyzed by the morphological analysis unit 120.

次に、ステップＳ１２において、重要語抽出部３１０は、ステップＳ１１の結果残った単語から、挨拶辞書３２２に記憶されている単語を除く。ここで、ステップＳ１２の結果残った単語を重要語候補とする。なお、ステップＳ１１とステップＳ１２とは、順序が逆であってもよい。 Next, in step S12, the important word extraction unit 310 removes words stored in the greeting dictionary 322 from the words remaining as a result of step S11. Here, the word remaining as a result of step S12 is set as an important word candidate. Note that the order of step S11 and step S12 may be reversed.

次に、ステップＳ１３において、重要語抽出部３１０は、重要語候補から１つの単語を抽出する。なお、抽出した単語は、重要語候補から除く。 Next, in step S <b> 13, the keyword extraction unit 310 extracts one word from the keyword candidates. The extracted word is excluded from the important word candidates.

次に、ステップＳ１４において、重要語抽出部３１０は、ステップＳ１３で抽出された単語を含む書き込み情報の投稿日時に基づいて、ステップＳ１３で抽出された単語の時節を特定する。 Next, in step S14, the important word extraction unit 310 identifies the time period of the word extracted in step S13 based on the posting date and time of the writing information including the word extracted in step S13.

次に、ステップＳ１５において、重要語抽出部３１０は、ステップＳ１４で特定された時節に対応付けてステップＳ１３で抽出された単語が、時節別単語辞書３２３に格納されているか否かを判断する。時節別単語辞書３２３に格納されている場合（ＹＥＳ）には、ステップＳ１３で抽出された単語を重要語として抽出せず、ステップＳ１７に処理を進める。一方、時節別単語辞書３２３に格納されていない場合（ＮＯ）には、ステップＳ１６に処理を進める。 Next, in step S15, the important word extraction unit 310 determines whether or not the word extracted in step S13 in association with the time clause specified in step S14 is stored in the hourly word dictionary 323. If it is stored in the time-dependent word dictionary 323 (YES), the word extracted in step S13 is not extracted as an important word, and the process proceeds to step S17. On the other hand, if it is not stored in the hourly word dictionary 323 (NO), the process proceeds to step S16.

ステップＳ１６において、重要語抽出部３１０は、ステップＳ１３で抽出された単語を重要語に含める。 In step S16, the keyword extraction unit 310 includes the word extracted in step S13 in the keyword.

ステップＳ１７において、重要語抽出部３１０は、重要語候補が残っているか否かを判断する。重要語候補が残っている場合（ＹＥＳ）には、ステップＳ１３に処理を戻し、一方、重要語候補が残っていない場合（ＮＯ）には、処理を終了する。 In step S <b> 17, the keyword extraction unit 310 determines whether keyword candidates remain. If important word candidates remain (YES), the process returns to step S13. On the other hand, if no important word candidates remain (NO), the process ends.

このように、時節別単語辞書を用いる前に、指示代名詞辞書および挨拶辞書に格納されている単語を除くことによって、時節を特定する単語の数を減すことができ、その結果、効率よく重要語を抽出することができる。 In this way, by using words stored in the demonstrative pronoun dictionary and greeting dictionary before using the time-dependent word dictionary, it is possible to reduce the number of words that specify the time, and as a result, efficiently Words can be extracted.

ユニークユーザ数取得部１４０は、まず、重要語抽出部３１０で抽出された重要語毎に、コミュニケーション情報取得部１１０で取得したコミュニケーション情報から、各重要語を含むコミュニケーション情報を抽出する。そして、ユニークユーザ数取得部１４０は、重要語毎に、抽出したコミュニケーション情報のユーザ情報に基づいて、ユニークユーザ数をカウントし、取得する。ここで、ユニークユーザ数とは、重要語を含むコミュニケーション情報を発信したユニークなユーザの数であって、同じユーザが同じ重要語を含むコミュニケーション情報を何度してもまとめて１回の発信としてカウントした、重要語を使用したユーザの正味人数である。 The unique user number acquisition unit 140 first extracts communication information including each important word from the communication information acquired by the communication information acquisition unit 110 for each important word extracted by the important word extraction unit 310. And the unique user number acquisition part 140 counts and acquires the number of unique users based on the user information of the extracted communication information for every important word. Here, the number of unique users is the number of unique users who have transmitted communication information including important words, and the same user can collectively transmit communication information including the same important words as one transmission. This is the net number of users who used the important words.

重み付け係数算出部１５２は、まず、重要語抽出部３１０で抽出された重要語毎に、コミュニケーション情報取得部１１０で取得したコミュニケーション情報から、各重要語を含むコミュニケーション情報数をカウントし、取得する。次に、重み付け係数算出部１５２は、重要語抽出部３１０で抽出された重要語毎に、第１の実施形態で示した数１に示すように、取得したコミュニケーション情報数で、ユニークユーザ数取得部１４２で取得されたユニークユーザ数を割った商を重み付け係数αとして算出する。重み付け係数αは、ユニークユーザ数が少ない、すなわち、特定ユーザの使用頻度が高い重要語の場合には、値が小さくなり、一方、ユニークユーザ数が大きい、すなわち、特定ユーザの使用頻度が低く、多くのユニークユーザが使用している重要語の場合には、値が大きくなる。 The weighting coefficient calculation unit 152 first counts and acquires the number of pieces of communication information including each important word from the communication information acquired by the communication information acquisition unit 110 for each important word extracted by the important word extraction unit 310. Next, the weighting coefficient calculation unit 152 obtains the number of unique users for each important word extracted by the important word extraction unit 310 with the acquired number of communication information as shown in the equation 1 shown in the first embodiment. A quotient obtained by dividing the number of unique users acquired by the unit 142 is calculated as a weighting coefficient α. The weighting coefficient α has a small value in the case of an important word with a small number of unique users, that is, a high usage frequency of a specific user, while the number of unique users is large, that is, the usage frequency of a specific user is low. For important words used by many unique users, the value increases.

話題語抽出部１６２は、重要度算出部１３０で算出された重要度に、重み付け係数算出部１５２で算出された重み付け係数αを乗算し、重み付け係数αが乗算された重要度に基づいて、重要語抽出部３１０で抽出された重要語の中から話題語を抽出する。例えば、話題語抽出部１６０は、重み付け係数αが乗算された重要度が、予め設定されたしきい値以上である重要語を話題語として抽出する。また、話題語抽出部１６０は、重要語抽出部３１０で抽出された重要語を、重み付け係数αが乗算された重要度が高い順に並べ、上位から所定の数の重要語を話題語としてもよい。 The topic word extraction unit 162 multiplies the importance calculated by the importance calculation unit 130 by the weighting coefficient α calculated by the weighting coefficient calculation unit 152, and based on the importance obtained by multiplying the weighting coefficient α by the importance A topic word is extracted from the important words extracted by the word extraction unit 310. For example, the topic word extraction unit 160 extracts, as a topic word, an important word whose importance level multiplied by the weighting coefficient α is equal to or greater than a preset threshold value. The topic word extraction unit 160 may arrange the important words extracted by the keyword extraction unit 310 in descending order of importance multiplied by the weighting coefficient α, and may use a predetermined number of important words as the topic words from the top. .

以上説明したように、本実施形態によれば、インターネットを介したユーザ間のコミュニケーションを円滑にするために用いられるが、コミュニケーションの内容の特徴を表す重要語でない単語を除いた重要語の中から、ｔｆ−ｉｄｆ等により算出された重要度に、ユニークユーザを考慮した重み付け係数αを乗算した重要度に基づいて話題語を抽出することにより、ソーシャル・メディアの多くのユーザが話題にしている真の話題語を抽出することができる。 As described above, according to the present embodiment, it is used to facilitate communication between users via the Internet, but from important words excluding non-important words representing the characteristics of the content of communication. , Tf-idf, etc., by extracting the topic word based on the importance obtained by multiplying the importance calculated by the weighting coefficient α taking into account the unique user, the truth that many users of social media are talking about Topic words can be extracted.

なお、話題語抽出装置の処理をコンピュータ読み取り可能な記録媒体に記録し、この記録媒体に記録されたプログラムを、話題語抽出装置を構成する各機器に読み込ませ、実行することによって本発明の話題語抽出装置を実現することができる。ここでいうコンピュータシステムとは、ＯＳや周辺装置等のハードウェアを含む。 The processing of the topic word extraction device is recorded on a computer-readable recording medium, and the program recorded on the recording medium is read and executed by each device constituting the topic word extraction device. A word extraction device can be realized. The computer system here includes an OS and hardware such as peripheral devices.

また、「コンピュータシステム」は、ＷＷＷ（ＷｏｒｌｄＷｉｄｅＷｅｂ）システムを利用している場合であれば、ホームページ提供環境（あるいは表示環境）も含むものとする。また、上記プログラムは、このプログラムを記憶装置等に格納したコンピュータシステムから、伝送媒体を介して、あるいは、伝送媒体中の伝送波により他のコンピュータシステムに伝送されても良い。ここで、プログラムを伝送する「伝送媒体」は、インターネット等のネットワーク（通信網）や電話回線等の通信回線（通信線）のように情報を伝送する機能を有する媒体のことをいう。 Further, the “computer system” includes a homepage providing environment (or display environment) if a WWW (World Wide Web) system is used. The program may be transmitted from a computer system storing the program in a storage device or the like to another computer system via a transmission medium or by a transmission wave in the transmission medium. Here, the “transmission medium” for transmitting the program refers to a medium having a function of transmitting information, such as a network (communication network) such as the Internet or a communication line (communication line) such as a telephone line.

また、上記プログラムは、前述した機能の一部を実現するためのものであっても良い。更に、前述した機能をコンピュータシステムにすでに記録されているプログラムとの組み合わせで実現できるもの、いわゆる差分ファイル（差分プログラム）であっても良い。 The program may be for realizing a part of the functions described above. Furthermore, what can implement | achieve the function mentioned above in combination with the program already recorded on the computer system, and what is called a difference file (difference program) may be sufficient.

以上、この発明の実施形態につき、図面を参照して詳述してきたが、具体的な構成はこの実施形態に限られるものではなく、この発明の要旨を逸脱しない範囲の設計等も含まれる。 The embodiments of the present invention have been described in detail with reference to the drawings. However, the specific configuration is not limited to the embodiments, and includes designs and the like that do not depart from the gist of the present invention.

１０コミュニケーション情報蓄積サーバ
１００話題語抽出装置
１１０コミュニケーション情報取得部
１２０形態素解析部
１３０重要度算出部
１４０ユニークユーザ数取得部
１５０重み付け係数算出部
１６０話題語抽出部 DESCRIPTION OF SYMBOLS 10 Communication information storage server 100 Topic word extraction apparatus 110 Communication information acquisition part 120 Morphological analysis part 130 Importance calculation part 140 Unique user number acquisition part 150 Weighting coefficient calculation part 160 Topic word extraction part

Claims

A topic word extraction device that extracts topic words from communication information transmitted for communication between users via the Internet,
Communication information acquisition means for acquiring communication information for a predetermined period from one or more communication information storage servers storing the communication information;
Morphological analysis means for analyzing text information included in the acquired communication information and extracting words;
Importance calculating means for calculating importance for each word extracted by the morpheme analyzing means;
The number of unique users that refers to the communication information acquired by the communication information acquisition means and acquires the number of unique users who have transmitted the communication information including each word in the text information for each word extracted by the morphological analysis means Acquisition means;
A weighting coefficient calculating means for calculating a weighting coefficient considering a unique user based on the number of unique users acquired by the unique user number acquiring means and the number of communication information acquired by the communication information acquiring means;
Topic word extraction for extracting a topic word from words extracted by the morpheme analyzing means based on the importance for each word calculated by the importance calculating means and the weighting coefficient calculated by the weighting coefficient calculating means Means,
A topic word extraction device.

The communication information includes retransmission information indicating whether or not the communication information transmitted by another user is retransmitted,
Referring to the communication information acquired by the communication information acquisition means, for each word extracted by the morphological analysis means, the communication information transmitted by the other user is retransmitted based on the retransmission information. Re-sending communication information number acquisition means for acquiring the number of re-sending communication information indicating the number of communication information being
The weighting coefficient calculating means determines the propagation degree of communication information based on the number of retransmitted communication information acquired by the retransmitted communication information number acquiring means and the number of communication information acquired by the communication information acquiring means. The topic word extraction device according to claim 1, wherein a weighting coefficient to be considered is calculated.

The communication information storage server manages a link relationship between the users,
Link relationship acquisition means for acquiring a link relationship between users from the communication information storage server;
With reference to the communication information acquired by the communication information acquisition unit, for each word extracted by the morpheme analysis unit, a unique user who has transmitted the communication information including each word is acquired, and the acquired unique user and Based on the link relationship acquired by the link relationship acquisition unit, a link user number acquisition unit that acquires the number of link users linked to other unique users among the acquired unique users;
With
The weighting factor calculating means weights considering the link relationship between the users based on the number of link users acquired by the number of link user acquisition means and the number of unique users acquired by the number of unique user acquisition means. The topic word extraction device according to claim 1, wherein a coefficient is calculated.

Using a dictionary prepared in advance to store words that are not important words representing the characteristics of the content of text information included in the communication information, but used to facilitate communication between users via the Internet, An important word extraction unit for extracting an important word by removing a word stored in the dictionary from the word extracted by the morphological analysis unit;
The importance calculating means calculates importance for each important word extracted by the important word extracting means;
The topic word extracting unit calculates the importance calculated by the importance calculating unit based on the importance for each important word calculated by the importance calculating unit and the weighting coefficient calculated by the weighting coefficient calculating unit. The topic word extraction device according to claim 1, wherein a topic word is extracted from a word.

The dictionary includes an indicating pronoun dictionary that stores demonstrative pronouns, a greeting dictionary that stores words used for greetings, and a time-dependent word dictionary that stores words related to time periods for each time zone. 4. The topic word extraction device according to 4.

The important word extraction means extracts important word candidates from the words extracted by the morphological analysis means, except for the words stored in the demonstrative pronoun dictionary and the greeting dictionary,
Whether for each word of the extracted important word candidate, the combination of the word and the word specified based on the transmission date and time of the communication information of the text information including the word is stored in the time-dependent word dictionary 6. The topic word extraction apparatus according to claim 5, wherein a word that is not stored in the time-dependent word dictionary is extracted as an important word.

The topic word extraction device according to claim 5 or 6, wherein the word used for the greeting includes a word used for a chat exchanged before entering the main topic in a conversation or at the end of the conversation.

The topic word extracting device according to claim 5, wherein the time is a season, a day of the week, and a time zone.

Communication information acquisition means, morphological analysis means, importance calculation means, unique user number acquisition means, weighting coefficient calculation means, and topic word extraction means, from communication information transmitted for communication between users via the Internet A topic word extraction method in a topic word extraction device for extracting a topic word,
A first step in which the communication information acquisition means acquires communication information for a predetermined period from one or more communication information storage servers storing the communication information;
A second step in which the morphological analysis means performs a morphological analysis on text information included in the communication information acquired in the first step, and extracts a word;
A third step in which the importance calculation means calculates importance for each word extracted by the morpheme analysis means;
The unique user number acquisition means refers to the communication information acquired in the first step and transmits communication information including each word in the text information for each word extracted in the third step. A fourth step of acquiring the number of unique users;
The weighting coefficient calculating means calculates a weighting coefficient that takes into account a unique user based on the number of unique users acquired in the fourth step and the number of communication information acquired in the first step. 5 steps,
The topic word extraction unit is configured to extract the word extracted in the third step based on the importance for each word calculated in the third step and the weighting coefficient calculated in the fifth step. A sixth step of extracting topic words;
A topic word extraction method characterized by including

Communication information acquisition means, morphological analysis means, importance calculation means, unique user number acquisition means, weighting coefficient calculation means, and topic word extraction means, from communication information transmitted for communication between users via the Internet A program for causing a computer to execute a topic word extraction method in a topic word extraction device for extracting a topic word,
A first step in which the communication information acquisition means acquires communication information for a predetermined period from one or more communication information storage servers storing the communication information;
A second step in which the morphological analysis means performs a morphological analysis on text information included in the communication information acquired in the first step, and extracts a word;
A third step in which the importance calculation means calculates importance for each word extracted by the morpheme analysis means;
The unique user number acquisition means refers to the communication information acquired in the first step and transmits communication information including each word in the text information for each word extracted in the third step. A fourth step of acquiring the number of unique users;
The weighting coefficient calculating means calculates a weighting coefficient that takes into account a unique user based on the number of unique users acquired in the fourth step and the number of communication information acquired in the first step. 5 steps,
The topic word extraction unit is configured to extract the word extracted in the third step based on the importance for each word calculated in the third step and the weighting coefficient calculated in the fifth step. A sixth step of extracting topic words;
A program that causes a computer to execute.