JP6042789B2

JP6042789B2 - Profile word extraction device, profile word extraction method, and profile word extraction program

Info

Publication number: JP6042789B2
Application number: JP2013236960A
Authority: JP
Inventors: 伊藤　淳; 淳伊藤; 良彦数原; 浩之戸田; 鷲崎　誠司; 誠司鷲崎
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2013-11-15
Filing date: 2013-11-15
Publication date: 2016-12-14
Anticipated expiration: 2033-11-15
Also published as: JP2015097025A

Description

本発明は、情報抽出技術の分野に関わり、特に、マイクロブログにおけるプロフィール語抽出装置、方法およびプログラムに関する． The present invention relates to the field of information extraction technology, and more particularly to a profile word extraction apparatus, method, and program for microblogging.

Ｔｗｉｔｔｅｒ（登録商標）に代表されるマイクロブログは、短い文書の投稿によって情報を発信したり、コミュニケーションをとったりする、ブログの一種となるメディアである。マイクロブログを利用しているユーザは、自由記述形式で自己紹介のための文書（プロフィール文書）を掲載することができる。プロフィール文書の中で特徴的に出現する、ユーザの特性を良く表すような単語（例えば、ユーザの興味をよく表している、スターウォーズ、嵐、猫などの単語）をプロフィール語という。マイクロブログデータからプロフィール語を抽出する装置が、プロフィール語抽出装置である。 A microblog represented by Twitter (registered trademark) is a type of blog that transmits information and communicates by posting a short document. A user using a microblog can post a self-introduction document (profile document) in a free description format. Words that are characteristically appearing in the profile document and that clearly represent the user's characteristics (for example, words such as Star Wars, Arashi, and Cat that often express the user's interest) are called profile words. A device that extracts profile words from microblog data is a profile word extraction device.

映画、音楽などの専用の入力欄ごとに具体的な映画名、アーティスト名などを記載できるＦａｃｅｂｏｏｋ（登録商標）などのソーシャルネットワーキングサービス（ＳＮＳ）と異なり、マイクロブログは専用の入力欄がないため、どのようにプロフィール文を記述するかはユーザによって異なる。そのため、ルールベースで自動的にプロフィール語を抽出することが難しい。 Unlike social networking services (SNS) such as Facebook (registered trademark) that can describe a specific movie name, artist name, etc. for each dedicated input field for movies, music, etc., microblogging has no dedicated input field, How to write a profile sentence depends on the user. Therefore, it is difficult to automatically extract profile words on a rule basis.

そこで、従来技術では、例えばＬａｔｅｎｔＤｉｒｉｃｈｌｅｔＡｌｌｏｃａｔｉｏｎ（非特許文献１参照）などのトピックモデルを用いて、プロフィール文書に出現する単語のクラスタリングを行い、トピック確率の高い単語をプロフィール語として抽出している。トピック確率の低い単語は、あらゆるトピックに出現する汎用的な単語なので、プロフィール語として適切ではないと判断し、抽出しない。 Therefore, in the prior art, for example, words that appear in a profile document are clustered using a topic model such as Late Dirichlet Allocation (see Non-Patent Document 1), and words having a high topic probability are extracted as profile words. A word having a low topic probability is a general-purpose word that appears in every topic, so it is determined that it is not appropriate as a profile word and is not extracted.

Ｄ．Ｂｌｅｉ，Ａ．Ｎｇ，ａｎｄＭ．Ｊｏｒｄａｎ．“ＬａｔｅｎｔＤｉｒｉｃｈｌｅｔＡｌｌｏｃａｔｉｏｎ”．ＪｏｕｒｎａｌｏｆＭａｃｈｉｎｅＬｅａｒｎｉｎｇＲｅｓｅａｒｃｈ，３，Ｊａｎｕａｒｙ２００３，ｐｐ．９９３−１０２２D. Blei, A .; Ng, and M.M. Jordan. “Lent Dirichlet Allocation”. Journal of Machine Learning Research, 3, January 2003, pp. 993-1022

プロフィール文書は自由記述形式であるため、プロフィール文書の中にはプロフィール語の他に、プロフィール文書以外の文書（投稿文書など）にも出現するようなプロフィール語として好ましくない単語（ツイート、フォローミーなど）も含まれている。従来技術のように、プロフィール文書のみを情報源としてプロフィール語の抽出を行うのでは、そのような好ましくない単語であるかを判断することができない。そのため、好ましくない単語をプロフィール語として誤抽出し、プロフィール語の抽出精度が低下するおそれがあった。 Since profile documents are in a free description format, in profile documents, in addition to profile words, unfavorable words (tweets, follow me, etc.) that appear in documents other than profile documents (posted documents, etc.) ) Is also included. If the profile word is extracted using only the profile document as an information source as in the prior art, it cannot be determined whether the word is such an undesirable word. For this reason, an unfavorable word may be erroneously extracted as a profile word, and the profile word extraction accuracy may be reduced.

本発明は上記課題を解決するものであり、その目的は、プロフィール語を精度よく抽出することができるプロフィール語抽出装置、方法、プログラムを提供することにある。 The present invention solves the above-described problems, and an object of the present invention is to provide a profile word extraction apparatus, method, and program capable of accurately extracting profile words.

上記課題を解決するための本発明のプロフィール語抽出装置は、複数の文書が格納された文書データベースと、前記文書データベースに格納された各文書を、ユーザの特性を表すプロフィール語を有したプロフィール文書と、少なくともユーザの投稿文書又はハッシュタグ文書とに分割し、ユーザ毎の、プロフィール文書集合と、少なくとも投稿文書集合又はハッシュタグ文書集合のいずれかの文書集合とを取得する情報源分割手段と、前記情報源分割手段によって取得された各文書を形態素解析して単語群に分割し、その各単語の出現頻度をカウントして、プロフィール文書単語出現頻度情報と、少なくとも投稿文書単語出現頻度情報又はハッシュタグ文書単語出現頻度情報のいずれかの情報とを取得する単語出現頻度カウント手段と、前記単語出現頻度カウント手段によって取得された各単語出現頻度情報を用いて、プロフィール文書単語出現頻度と、少なくとも投稿文書単語出現頻度又はハッシュタグ文書単語出現頻度のいずれかの単語出現頻度とのオッズ比を計算し、該オッズ比が、設定した抽出条件に適合するか否かを判定し、適合すると判定された単語をプロフィール語として抽出するプロフィール語抽出手段と、を備えたことを特徴としている。 In order to solve the above problems, a profile word extraction apparatus according to the present invention includes a document database storing a plurality of documents, and a profile document having profile words representing user characteristics for each document stored in the document database. And at least a user's posted document or hash tag document, and information source dividing means for obtaining a profile document set for each user and obtaining at least a document set of either a posted document set or a hash tag document set. Each document acquired by the information source dividing means is morphologically analyzed and divided into word groups, the appearance frequency of each word is counted, profile document word appearance frequency information, and at least posted document word appearance frequency information or hash Word appearance frequency counting means for acquiring any of the tag document word appearance frequency information, Using each word appearance frequency information acquired by the word appearance frequency counting means, an odds ratio between the profile document word appearance frequency and at least one of the posted document word appearance frequency and the hash tag document word appearance frequency is calculated. Profile word extracting means for calculating and determining whether or not the odds ratio matches a set extraction condition and extracting a word determined to be matched as a profile word.

上記構成によれば、文書から分割した複数の情報源における単語の出現頻度を比較することで、プロフィール文書において特徴的に出現した単語を判断することが可能となる。これによって、プロフィール文書以外にも出現するプロフィール語として不適切な単語の誤抽出を防ぐことができ、プロフィール語を精度よく抽出することができる。 According to the above configuration, it is possible to determine words that have appeared characteristically in the profile document by comparing the appearance frequencies of words in a plurality of information sources divided from the document. Accordingly, it is possible to prevent erroneous extraction of an inappropriate word as a profile word that appears in addition to the profile document, and it is possible to accurately extract the profile word.

本発明によれば、プロフィール文書以外にも出現するプロフィール語として不適切な単語の誤抽出を防ぐことができ、プロフィール語を精度よく抽出することができる。 According to the present invention, it is possible to prevent erroneous extraction of an inappropriate word as a profile word that appears in addition to a profile document, and it is possible to accurately extract a profile word.

本発明の一実施形態例によるプロフィール語抽出装置の構成図。1 is a configuration diagram of a profile word extraction device according to an embodiment of the present invention. 図１の装置の各部で用いられるデータの構造を示す説明図。Explanatory drawing which shows the structure of the data used by each part of the apparatus of FIG. 図１の装置の各部が実行する処理のフローチャート。The flowchart of the process which each part of the apparatus of FIG. 1 performs. 図１の装置のプロフィール語抽出部の処理の例を表し、（ａ）はプロフィール文書で出現した単語の頻度を示すグラフ、（ｂ）は投稿文書で出現した単語の頻度を示すグラフ。The example of the process of the profile word extraction part of the apparatus of FIG. 1 is represented, (a) is a graph which shows the frequency of the word which appeared in the profile document, (b) is a graph which shows the frequency of the word which appeared in the contribution document.

以下、図面を参照しながら本発明の実施の形態を説明するが、本発明は下記の実施形態例に限定されるものではない。本発明の一実施形態例の構成図を図１に示す。図１は、マイクロブログ文書が複数格納されているマイクロブログデータベース（ＤＢ）を入力とし、プロフィール語集合を出力するプロフィール語抽出装置である。 Hereinafter, embodiments of the present invention will be described with reference to the drawings, but the present invention is not limited to the following embodiments. FIG. 1 shows a configuration diagram of an embodiment of the present invention. FIG. 1 is a profile word extraction device that receives a microblog database (DB) storing a plurality of microblog documents and outputs a set of profile words.

図１において、１０１はマイクロブログ文書が格納されたマイクロブログＤＢである。 In FIG. 1, reference numeral 101 denotes a microblog DB in which microblog documents are stored.

１０２は、前記マイクロブログＤＢ１０１に格納された各文書を、ユーザの特性を表すプロフィール語を有したプロフィール文書と、ユーザの投稿文書と、ハッシュタグ文書とに分割し、ユーザ毎の、プロフィール文書集合１０３と、投稿文書集合１０４と、ハッシュタグ文書集合１０５とを取得する情報源分割手段としての情報源分割部である。 102 divides each document stored in the microblog DB 101 into a profile document having a profile word representing a user characteristic, a user posted document, and a hash tag document, and a profile document set for each user. 103, an information source dividing unit serving as an information source dividing unit that acquires the posted document set 104 and the hash tag document set 105.

１０６は、前記情報源分割部１０２によって取得された各文書集合１０３、１０４、１０５の各文書を形態素解析して単語群に分割し、その各単語の出現頻度をカウントして、プロフィール文書単語出現頻度情報１０７と、投稿文書単語出現頻度情報１０８と、ハッシュタグ文書単語出現頻度情報１０９とを取得する単語出現頻度カウント手段としての単語出現頻度カウント部である。 106 morphologically analyzes each document of each document set 103, 104, 105 acquired by the information source dividing unit 102, divides it into word groups, counts the appearance frequency of each word, and generates profile document word appearance This is a word appearance frequency counting unit as word appearance frequency counting means for acquiring the frequency information 107, the posted document word appearance frequency information 108, and the hash tag document word appearance frequency information 109.

１１０は、前記単語出現頻度カウント部１０６によって取得された各単語出現頻度情報１０７〜１０９を用いて、プロフィール文書単語出現頻度と投稿文書単語出現頻度とのオッズ比、プロフィール文書単語出現頻度とハッシュタグ文書単語出現頻度とのオッズ比を計算し、該オッズ比が、設定した抽出条件に適合するか否かを判定し、適合すると判定された単語をプロフィール語集合１１１として抽出するプロフィール語抽出手段としてのプロフィール語抽出部である。 110 is an odds ratio between the profile document word appearance frequency and the posted document word appearance frequency, the profile document word appearance frequency and the hash tag, using the word appearance frequency information 107 to 109 acquired by the word appearance frequency counting unit 106. As a profile word extraction unit that calculates an odds ratio with the document word appearance frequency, determines whether the odds ratio meets a set extraction condition, and extracts a word determined to be matched as a profile word set 111 Profile word extraction unit.

図１のプロフィール語抽出装置は、例えばコンピュータにより構成され、通常のコンピュータのハードウェアリソース、例えばＲＯＭ、ＲＡＭ、ＣＰＵ、入力装置、出力装置、通信インターフェース、ハードディスク、記録媒体およびその駆動装置を備えている。 The profile word extraction device of FIG. 1 is configured by a computer, for example, and includes hardware resources of a normal computer, such as ROM, RAM, CPU, input device, output device, communication interface, hard disk, recording medium, and driving device thereof. Yes.

このハードウェアリソースとソフトウェアリソース（ＯＳ、アプリケーションなど）との協働の結果、プロフィール語抽出装置は、図１に示すように、マイクロブログＤＢ１０１、情報源分割部１０２、単語出現頻度カウント部１０６およびプロフィール語抽出部１１０を実装する。 As a result of the cooperation between the hardware resource and the software resource (OS, application, etc.), the profile word extraction device, as shown in FIG. 1, the microblog DB 101, the information source dividing unit 102, the word appearance frequency counting unit 106, and The profile word extraction unit 110 is implemented.

前記マイクロブログＤＢ１０１はハードディスクあるいはＲＡＭなどの保存手段・記憶手段に構築され、情報源分割部１０２によって取得されたプロフィール語文書集合１０３、投稿文書集合１０４、ハッシュタグ文書集合１０５、および単語出現頻度カウンタ部１０６により取得されたプロフィール文書単語出現頻度情報１０７、投稿文書単語出現頻度情報１０８、ハッシュタグ文書単語出現頻度情報１０９はＲＡＭなどの保存手段・記憶手段に保存される。 The microblog DB 101 is constructed in storage means / storage means such as a hard disk or RAM, and the profile word document set 103, the posted document set 104, the hash tag document set 105, and the word appearance frequency counter acquired by the information source dividing unit 102. The profile document word appearance frequency information 107, the posted document word appearance frequency information 108, and the hash tag document word appearance frequency information 109 acquired by the unit 106 are stored in a storage unit / storage unit such as a RAM.

前記プロフィール語抽出部１１０により抽出されたプロフィール語集合１１１はＲＡＭなどの保存手段・記憶手段に保存されるか、又は外部に出力される。 The profile word set 111 extracted by the profile word extraction unit 110 is stored in storage means / storage means such as a RAM, or is output to the outside.

ここで、本実施形態例で用いるデータの構造を図２に示す。マイクロブログＤＢ１０１には複数ユーザのマイクロブログ文書（図２（ａ），（ｂ））が格納されており、１人のユーザあたり複数のマイクロブログ文書が格納されている。１つのマイクロブログ文書は、ｉｄ、ｔｅｘｔ、ｄｅｓｃｒｉｐｔｉｏｎ、ｈａｓｈｔａｇｓを含んでいる。ｉｄはユーザを識別するためのユニークな識別番号である。ｔｅｘｔはユーザが投稿した文書であり、投稿文書とハッシュタグ（ｈａｓｈｔａｇｓを構成する個々の単語とハッシュタグを示す＃の文字で構成されたもの）の組み合わせにより構成されている。ｄｅｓｃｒｉｐｔｉｏｎはプロフィール文書である。ｈａｓｈｔａｇｓはｔｅｘｔに含まれるハッシュタグのリストである。 Here, FIG. 2 shows the data structure used in this embodiment. The microblog DB 101 stores microblog documents of a plurality of users (FIGS. 2A and 2B), and stores a plurality of microblog documents per user. One microblog document includes id, text, description, and hashtags. id is a unique identification number for identifying the user. The text is a document posted by the user, and is composed of a combination of a posted document and a hash tag (consisting of individual words constituting hashtags and # characters indicating hash tags). The description is a profile document. hashtags is a list of hash tags included in the text.

マイクロブログ文書は１ユーザあたり複数存在するため、情報源分割部１０２で処理されたプロフィール文書、投稿文書、ハッシュタグ文書は、図２（ｃ），（ｄ），（ｅ）のように個々のマイクロブログ文書に含まれていたプロフィール文書、投稿文書、ハッシュタグを改行によって連結し、それぞれ１つの文書にユーザごとにまとめることによって得られる。なお、ハッシュタグが複数存在する場合は、図２（ｅ）のようにそれぞれが半角スペースによって連結される。 Since there are a plurality of microblog documents per user, the profile document, the posted document, and the hash tag document processed by the information source dividing unit 102 are shown in FIGS. 2 (c), 2 (d), and 2 (e). It is obtained by concatenating profile documents, posted documents, and hash tags included in the microblog document by line feeds and grouping them into one document for each user. When there are a plurality of hash tags, each is connected by a half-width space as shown in FIG.

単語出現頻度カウンタ部１０６で処理された各文書集合の単語出現頻度情報の例を図２（ｆ），（ｇ），（ｈ）に示す。同じ単語が複数出現しても、同じユーザの文書であれば１としてカウントする。なお、その出現回数分だけカウント（例えば、図２（ｆ）のプロフィール文書単語出現頻度におけるラーメンと愛好家を２としてカウント）するとしても良い。前者の場合は、特異なユーザが存在するような状況においてもロバストになり（例えば、同じ単語を大量に記述しているようなスパム的なユーザの影響を抑えられる）、後者の場合は高出現頻度の単語がより重要視されるようになる。 Examples of word appearance frequency information of each document set processed by the word appearance frequency counter unit 106 are shown in FIGS. 2 (f), (g), and (h). Even if the same word appears multiple times, it is counted as 1 if it is a document of the same user. Note that the number of appearances may be counted (for example, ramen and lovers in the profile document word appearance frequency in FIG. 2F are counted as 2). In the former case, it is robust even in situations where there are unique users (for example, it is possible to suppress the influence of spammy users who describe the same word in large quantities), and in the latter case, high appearance Frequency words become more important.

次に、上記のように構成された装置の動作を図３のフローチャートとともに説明する。 Next, the operation of the apparatus configured as described above will be described with reference to the flowchart of FIG.

まず、情報源分割部１０３は次のステップｓ１−１からｓ１−６の処理を実行する。 First, the information source dividing unit 103 executes the processes of the following steps s1-1 to s1-6.

［ステップｓ１−１］入力であるマイクロブログＤＢ１０１に含まれるユーザ（ｉｄ）毎に繰り返しステップｓ１−２〜ステップｓ１−６の処理を行う。 [Step s1-1] Steps s1-2 to s1-6 are repeated for each user (id) included in the input microblog DB 101.

［ｓ１−２］ステップｓ１−１で選択されたｉｄのユーザが生成したマイクロブログ文書集合をマイクロブログＤＢ１０１から取得する。 [S1-2] The microblog document set generated by the user with the id selected in step s1-1 is acquired from the microblog DB 101.

［ステップｓ１−３］ステップｓ１−２で取得したマイクロブログ文書集合に含まれるマイクロブログ文書毎に、ステップｓ１−４〜ステップｓ１−５の処理を行う。 [Step s1-3] Steps s1-4 to s1-5 are performed for each microblog document included in the microblog document set acquired in step s1-2.

［ステップｓ１−４］ステップｓ１−３で選択されたマイクロブログ文書を処理する。プロフィール文書としてｄｅｓｃｒｉｐｔｉｏｎを、投稿文書としてｔｅｘｔからハッシュタグを除いたものを、ハッシュタグ文書としてｈａｓｈｔａｇｓを抽出し個々のハッシュタグを半角スペースによって連結したものを、それぞれ抽出する。ユーザ毎にそれぞれ１文書になるように、プロフィール文書、投稿文書、ハッシュタグ文書はステップｓ１−３の繰り返し毎に改行で連結して保存する（例えば図２（ｃ），（ｄ），（ｅ））。プロフィール文書、投稿文書、ハッシュタグ文書はユーザ毎に１文書それぞれ存在するので、全ユーザに対して処理が完了すると、プロフィール文書集合１０３、投稿文書集合１０４、ハッシュタグ文書集合１０５として保存される。 [Step s1-4] The microblog document selected in step s1-3 is processed. A description is extracted as a profile document, a post document is obtained by removing a hash tag from a text, a hash tag is extracted as a hash tag document, and individual hash tags are extracted by a single-byte space. The profile document, the posted document, and the hash tag document are concatenated with a line feed at each repetition of step s1-3 and stored so that there is one document for each user (for example, FIGS. 2C, 2D, and 2E). )). Since one profile document, one posted document, and one hash tag document exist for each user, when processing is completed for all users, the profile document set 103, posted document set 104, and hash tag document set 105 are stored.

［ステップｓ１−５］ステップｓ１−３で始まる繰り返しの終了判定を行う。 [Step s1-5] It is determined whether or not to repeat the process starting at step s1-3.

［ステップｓ１−６］ステップｓ１−６で始まる繰り返しの終了判定を行う。 [Step s1-6] The end of repetition starting at step s1-6 is determined.

次に単語出現頻度カウント部１０６は、以下のステップｓ２−１〜ステップｓ２−７の処理を実行する。 Next, the word appearance frequency counting unit 106 executes the processing of the following steps s2-1 to s2-7.

［ステップｓ２−１］プロフィール文書集合１０３、投稿文書集合１０４、ハッシュタグ文書集合１０５毎に、ステップｓ２−２〜ステップｓ２−７の処理を繰り返す。 [Step s2-1] Steps s2-2 to s2-7 are repeated for each profile document set 103, posted document set 104, and hash tag document set 105.

［ステップｓ２−２］ステップｓ２−１で選択された、対象文書集合に含まれる文書毎に、ステップｓ２−３〜ステップｓ２−６の処理を繰り返す。 [Step s2-2] Steps s2-3 to s2-6 are repeated for each document included in the target document set selected in step s2-1.

［ステップｓ２−３］対象文書を形態素解析し、文書を品詞つき単語群に分割する。形態素解析器は、例えばＭｅＣａｂ（ｈｔｔｐ：／／ｍｅｃａｂ．ｇｏｏｇｌｅｃｏｄｅ．ｃｏｍ／ｓｖｎ／ｔｒｕｎｋ／ｍｅｃａｂ／ｄｏｃ／ｉｎｄｅｘ．ｈｔｍｌ）など、従来製品を用いることができる。 [Step s2-3] The target document is subjected to morphological analysis, and the document is divided into words with parts of speech. A conventional product such as MeCab (http://mecab.googlecode.com/svn/trunk/mecab/doc/index.html) can be used as the morphological analyzer.

［ステップｓ２−４］ステップｓ２−３で得られた品詞つき単語群から、事前に定義されたストップワードに該当する単語、および事前に定義された抽出対象品詞（名詞など）に該当しない単語を除外する。 [Step s2-4] From the word group with part of speech obtained in step s2-3, a word corresponding to the stop word defined in advance and a word not corresponding to the extraction target part of speech (noun, etc.) defined in advance exclude.

［ステップｓ２−５］ステップｓ２−４の処理後に残っている単語の出現頻度をカウントする。図２の本発明で用いるデータの構造の説明にて述べたように、同じ単語が複数回出現しても１としてカウントしても良いし、登場した回数分だけカウントしても良い。カウントした情報は、プロフィール文書単語出現頻度情報１０７、投稿文書単語出現頻度情報１０８、ハッシュタグ文書単語出現頻度情報１０９として各々保存される。 [Step s2-5] The appearance frequency of words remaining after the processing of step s2-4 is counted. As described in the description of the data structure used in the present invention in FIG. 2, even if the same word appears multiple times, it may be counted as 1, or it may be counted as many times as it appeared. The counted information is stored as profile document word appearance frequency information 107, posted document word appearance frequency information 108, and hash tag document word appearance frequency information 109, respectively.

［ステップｓ２−６］ステップｓ２−２で始まる繰り返しの終了判定を行う。 [Step s2-6] It is determined whether or not to repeat the process starting at step s2-2.

［ステップｓ２−７］ステップｓ２−１で始まる繰り返しの終了判定を行う。 [Step s2-7] It is determined whether or not to repeat the process starting at step s2-1.

次にプロフィール語抽出部１１０は、以下のステップｓ３−１〜ステップｓ３−７の処理を実行する。 Next, the profile word extraction part 110 performs the process of the following steps s3-1 to s3-7.

［ステップｓ３−１］プロフィール文書集合においてカウントされた単語毎に、ステップｓ３−２〜ステップｓ３−６の処理を繰り返す。 [Step s3-1] The processing in steps s3-2 to s3-6 is repeated for each word counted in the profile document set.

［ステップｓ３−２］ステップｓ３−１において選択された単語をｗとする。プロフィール文書単語出現頻度情報１０７と投稿文書単語出現頻度情報１０８を用いてオッズ比（Ｏｄｄｓｒａｔｉｏ）ＯＲ_tを計算する。プロフィール文書で出現した単語頻度の総和をｎ_p*，ｗの単語頻度をｎ_pwとする。同様に、投稿文書で出現した単語頻度の総和をｎ_t*，ｗの単語頻度をｎ_twとする。このとき、ＯＲ_tは次の式（１）のように計算される。 [Step s3-2] Let the word selected in step s3-1 be w. The odds ratio OR _t is calculated using the profile document word appearance frequency information 107 and the posted document word appearance frequency information 108. The sum of the word frequencies that appear in the profile document is n _{p *} , and the word frequency of w is n _pw . Similarly, it is assumed that the total frequency of words appearing in the posted document is n _{t *} and the word frequency of w is n _tw . At this time, OR _t is calculated as in the following equation (1).

［ステップｓ３−３］プロフィール文書単語出現頻度情報１０７とハッシュタグ文書単語出現頻度情報１０９を用いて、オッズ比ＯＲ_hを計算する。ハッシュタグ文書で出現した単語頻度の総和をｎ_h*，ｗの単語頻度をｎ_hwとする。このとき、ＯＲ_hは次のように計算される。 [Step s3-3] The odds ratio OR _h is calculated using the profile document word appearance frequency information 107 and the hash tag document word appearance frequency information 109. Let n _{h * be} the sum of the word frequencies that appear in the hashtag document, and n _{hw be} the word frequency of w. At this time, OR _h is calculated as follows.

［ステップｓ３−４］前記オッズ比が事前に定義された抽出条件に適合するか否かの判定を行う。まず、事前定義の条件適合判断の前に、オッズ比ＯＲ_tとＯＲ_hが共に閾値より大きいかどうかを確認する。そうでない場合は、判定結果Ｎｏとしてステップｓ３−６へ進む。そうである場合、事前定義の条件適合判断をする。なお、閾値には通常１を用いるが、それ以外の値でも良い。事前定義では、少なくとも１つのオッズ比が統計的に有意、または全てのオッズ比が統計的に有意、のどちらかが設定されているものとする。オッズ比が統計的に有意であるかは、オッズ比の９５％信頼区間（Ｃｏｎｆｉｄｅｎｃｅｉｎｔｅｒｖａｌ）における検定で求める。オッズ比ＯＲ_tの信頼区間は、 [Step s3-4] It is determined whether or not the odds ratio meets a pre-defined extraction condition. First, it is confirmed whether or not the odds ratios OR _t and OR _h are both greater than the threshold before the pre-defined condition conformity judgment. Otherwise, the process proceeds to step s3-6 with the determination result No. If so, make a pre-defined condition compliance decision. Note that 1 is normally used as the threshold value, but other values may be used. In the pre-definition, it is assumed that at least one odds ratio is statistically significant, or all odds ratios are statistically significant. Whether the odds ratio is statistically significant is determined by a test in the 95% confidence interval of the odds ratio (Confidence interval). The confidence interval for the odds ratio OR _t is

で求める。 Ask for.

同様にして、オッズ比ＯＲ_hの信頼区間は、 Similarly, the confidence interval for the odds ratio OR _h is

で求める。 Ask for.

なお、上記式（３）、式（４）のｑの値は９５％信頼区間においては１．９６となる（正規分布表から読みとった値）。９９％信頼区間など、検定の条件を変更する場合はそれに応じてｑの値も変化する。例えば９９％信頼区間の場合は２．５８となる。なお、検定の条件は事前に設定されているものとする。信頼区間に１を含まないとき（信頼区間が１を挟まないとき）、そのオッズ比は有意として判断される。事前定義で少なくとも１つのオッズ比が統計的に有意であれば良いとされていた場合は、どちらか一方が有意であればＹｅｓとしてステップｓ３−５へ進み、それ以外はＮｏとしてステップｓ３−６へ進む。事前定義で全てのオッズ比が統計的に有意であれば良いとされていた場合は、両方が有意であればＹｅｓとしてステップｓ３−５へ進み、それ以外はＮｏとしてステップｓ３−６へ進む。 In addition, the value of q in the above formulas (3) and (4) is 1.96 in the 95% confidence interval (value read from the normal distribution table). When changing the test conditions such as 99% confidence interval, the value of q changes accordingly. For example, in the case of 99% confidence interval, it is 2.58. The test conditions are set in advance. When the confidence interval does not include 1 (when the confidence interval does not sandwich 1), the odds ratio is determined to be significant. If it is determined in advance that at least one odds ratio is statistically significant, if either one is significant, the process proceeds to step s3-5, and otherwise, the process proceeds to step s3-6. Proceed to If all the odds ratios are statistically significant in the predefined definition, if both are significant, the process proceeds to step s3-5 as Yes, otherwise proceeds to step s3-6.

図４にステップｓ３−４における処理の例を示す。図４の例では、少なくとも１つのオッズ比が統計的に有意であれば良いという事前定義の条件のもと、「学生」という単語がプロフィール語として抽出される様子を表している。 FIG. 4 shows an example of processing in step s3-4. In the example of FIG. 4, the word “student” is extracted as a profile word under a pre-defined condition that at least one odds ratio only needs to be statistically significant.

図４（ａ）において、プロフィール文書Ａにおいて「学生」という単語が出現するオッズＯｄｄｓ（Ａ，学生）は、Ｏｄｄｓ（Ａ，学生）＝ａ／ｂ＝１００／（１０００−１００）＝０．１１であり、プロフィール文書Ａにおいて「ツイート」という単語が出現するオッズＯｄｄｓ（Ａ，ツイート）は、Ｏｄｄｓ（Ａ，ツイート）＝ａ／ｂ＝５０／（１０００−５０）＝０．０５３である（ａはプロフィール文書における処理対象として選択された単語の出現頻度、ｂはプロフィール文書で出現した単語頻度の総和からａを差し引いた値）。 In FIG. 4A, odds Odds (A, student) in which the word “student” appears in profile document A is Odds (A, student) = a / b = 100 / (1000-100) = 0.11. Odds Odds (A, Tweet) in which the word “tweet” appears in the profile document A is Odds (A, Tweet) = a / b = 50 / (1000-50) = 0.053 (a Is the appearance frequency of the word selected as a processing target in the profile document, and b is a value obtained by subtracting a from the sum of the word frequencies appearing in the profile document.

また図４（ｂ）において、投稿文書Ｂにおいて「学生」という単語が出現するオッズＯｄｄｓ（Ｂ，学生）は、Ｏｄｄｓ（Ｂ，学生）＝ｃ／ｄ＝１００／（１００００−１００）＝０．０１であり、投稿文書Ｂにおいて「ツイート」という単語が出現するオッズＯｄｄｓ（Ｂ，ツイート）は、Ｏｄｄｓ（Ｂ，ツイート）＝ｃ／ｄ＝１０００／（１００００−１０００）＝０．１１である（ｃは投稿文書における処理対象として選択された単語の出現頻度、ｄは投稿文書で出現した単語頻度の総和からｃを差し引いた値）。 In FIG. 4B, odds Odds (B, student) in which the word “student” appears in the posted document B is Odds (B, student) = c / d = 100 / (10000-100) = 0. Odds Odds (B, Tweet) in which the word “tweet” appears in the posted document B is Odds (B, Tweet) = c / d = 1000 / (10000−1000) = 0.11. c is the appearance frequency of the word selected as the processing target in the posted document, and d is a value obtained by subtracting c from the sum of the word frequencies appearing in the posted document.

このため、処理対象単語が「学生」である場合のプロフィール文書単語出現頻度と投稿文書単語出現頻度とのオッズ比Ｏｄｄｓ（Ａ，学生）／Ｏｄｄｓ（Ｂ，学生）は、０．１１／０．０１＝１１＞１となり、１より大きいため特有な可能性がある。 Therefore, the odds ratio Odds (A, student) / Odds (B, student) between the profile document word appearance frequency and the posted document word appearance frequency when the processing target word is “student” is 0.11 / 0. Since 01 = 11> 1, it may be unique because it is greater than 1.

これに対して処理対象単語が「ツイート」である場合のプロフィール文書単語出現頻度と投稿文書単語出現頻度とのオッズ比Ｏｄｄｓ（Ａ，ツイート）／Ｏｄｄｓ（Ｂ，ツイート）は、０．０５３／０．１１＝０．４８＜１となり、１より小さいため特有でない。 On the other hand, the odds ratio Odds (A, Tweet) / Odds (B, Tweet) between the profile document word appearance frequency and the posted document word appearance frequency when the processing target word is “tweet” is 0.053 / 0. .11 = 0.48 <1 and is not unique because it is smaller than 1.

そして特有な可能性がある単語「学生」について、９５％信頼区間における前記オッズ比の検定を前記式（３）と同様の式（５）を用いて行う（ｑ＝１．９６、ＯＲ＝１１、ａ＝１００、ｂ＝９００、ｃ＝１００、ｄ＝９９００）。 Then, for the word “student” having a unique possibility, the odds ratio test in the 95% confidence interval is performed using the same equation (5) as the equation (3) (q = 1.96, OR = 11). A = 100, b = 900, c = 100, d = 9900).

式（５）で求められたオッズ比ＯＲの９５％信頼区間は[８．２９，１４．６３]となり、信頼区間に１を含まないので有意である（抽出条件に適合する）と判定されるものである。 The 95% confidence interval of the odds ratio OR obtained by the equation (5) is [8.29, 14.63], and is determined to be significant (conforms to the extraction condition) because it does not include 1 in the confidence interval. Is.

［ステップｓ３−５］単語ｗをプロフィール語として抽出する。 [Step s3-5] The word w is extracted as a profile word.

［ステップｓ３−６］ステップｓ３−１で始まる繰り返しの終了判定を行う。 [Step s3-6] The end of repetition starting at step s3-1 is determined.

［ステップｓ３−７］抽出された全プロフィール語をプロフィール語集合１１１として出力する。 [Step s3-7] The extracted all profile words are output as the profile word set 111.

上記のように本実施形態例では、プロフィール文書以外の情報源（投稿文書やハッシュタグ文書）を用いて、プロフィール文書に特徴的に出現しているかどうかを、それぞれの情報源における単語出現頻度情報を比較することで判断している。これにより、プロフィール文書以外にも出現するプロフィール語として不適切な単語の誤抽出を防ぐことができ、プロフィール語を精度よく抽出することができる。 As described above, in the present embodiment example, information sources other than profile documents (posted documents and hash tag documents) are used to determine whether or not they appear characteristically in profile documents, and word appearance frequency information in each information source. It is judged by comparing. Thereby, it is possible to prevent erroneous extraction of an inappropriate word as a profile word appearing other than the profile document, and it is possible to accurately extract the profile word.

尚、本実施形態例では３つの情報源としたが、投稿文書またはハッシュタグ文書のどちらか１つを用いた２つの情報源でも構わない。また、３つ以上でも構わない。さらに、追加する情報源はマイクロブログだけでなく、新聞やＷｅｂページなどの他メディアから得た文書でも構わない。プロフィール文書と他の文書における単語の出現頻度や共起情報の比較ができさえすれば良い。 In the present embodiment, three information sources are used, but two information sources using either one of a posted document and a hash tag document may be used. Three or more may be used. Furthermore, the information source to be added may be a document obtained from other media such as a newspaper or a Web page as well as a microblog. What is necessary is just to be able to compare the appearance frequency and co-occurrence information of words in the profile document and other documents.

また、本実施形態のプロフィール語抽出装置における各手段の一部もしくは全部の機能をコンピュータのプログラムで構成し、そのプログラムをコンピュータを用いて実行して本発明を実現することができること、本実施形態のプロフィール語抽出方法における手順をコンピュータのプログラムで構成し、そのプログラムをコンピュータに実行させることができることは言うまでもなく、コンピュータでその機能を実現するためのプログラムを、そのコンピュータが読み取り可能な記録媒体、例えばＦＤ（Ｆｌｏｐｐｙ（登録商標）Ｄｉｓｋ）や、ＭＯ（Ｍａｇｎｅｔｏ−Ｏｐｔｉｃａｌｄｉｓｋ）、ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）、メモリカード、ＣＤ（ＣｏｍｐａｃｔＤｉｓｋ）−ＲＯＭ、ＤＶＤ（ＤｉｇｉｔａｌＶｅｒｓａｔｉｌｅＤｉｓｋ）−ＲＯＭ、ＣＤ−Ｒ、ＣＤ−ＲＷ、ＨＤＤ、リムーバブルディスクなどに記録して、保存したり、配布したりすることが可能である。また、上記のプログラムをインターネットや電子メールなど、ネットワークを通して提供することも可能である。 Further, the present invention can be realized by configuring a part or all of the functions of each means in the profile word extraction device of the present embodiment by a computer program and executing the program using the computer. It goes without saying that the procedure in the profile word extraction method of the above can be configured by a computer program, and the program can be executed by the computer, and the program for realizing the function by the computer can be read by the computer, For example, FD (Floppy (registered trademark) Disk), MO (Magneto-Optical disk), ROM (Read Only Memory), memory card, CD (Compact Disk) -ROM, DVD (Digital V rsatile Disk) -ROM, CD-R, CD-RW, HDD, and recorded in a removable disk, or stored, it is possible or distribute. It is also possible to provide the above program through a network such as the Internet or electronic mail.

１０１…マイクロブログＤＢ
１０２…情報源分割部
１０３…プロフィール文書集合
１０４…投稿文書集合
１０５…ハッシュタグ文書集合
１０６…単語出現頻度カウント部
１０７…プロフィール文書単語出現頻度情報
１０８…投稿文書単語出現頻度情報
１０９…ハッシュタグ文書単語出現頻度情報
１１０…プロフィール語抽出部
１１１…プロフィール語集合 101 ... Microblog DB
DESCRIPTION OF SYMBOLS 102 ... Information source division part 103 ... Profile document set 104 ... Post document collection 105 ... Hash tag document set 106 ... Word appearance frequency count part 107 ... Profile document word appearance frequency information 108 ... Post document word appearance frequency information 109 ... Hash tag document Word appearance frequency information 110 ... Profile word extraction unit 111 ... Profile word set

Claims

A document database that stores multiple documents;
Each document stored in the document database is divided into a profile document having a profile word representing a user's characteristics and at least a user's posted document or hashtag document, and a profile document set for each user and at least a posted document Information source dividing means for acquiring either a document set or a hash tag document set;
Each document acquired by the information source dividing means is morphologically analyzed and divided into word groups, the appearance frequency of each word is counted, profile document word appearance frequency information, and at least posted document word appearance frequency information or hash Word appearance frequency counting means for acquiring any information of the tag document word appearance frequency information;
Using each word appearance frequency information acquired by the word appearance frequency counting means, an odds ratio between the profile document word appearance frequency and at least one of the posted document word appearance frequency and the hash tag document word appearance frequency Profile word extracting means for calculating whether or not the odds ratio meets a set extraction condition, and extracting a word determined to match as a profile word;
A profile word extraction device characterized by comprising:

The determination as to whether or not the extraction condition in the profile word extraction means is met is performed by determining whether or not the calculated odds ratio is statistically significant by testing a confidence interval based on a preset test condition. The profile word extracting device according to claim 1, wherein the profile word extracting device is implemented by performing the following.

An information source dividing unit divides a plurality of documents stored in a document database into a profile document having a profile word representing a user characteristic and at least a user posted document or a hash tag document, and a profile for each user. An information source dividing step for obtaining the document set and at least one of the posted document set and the hash tag document set;
The word appearance frequency counting means morphologically analyzes each document acquired by the information source dividing means, divides it into word groups, counts the appearance frequency of each word, and profile document word appearance frequency information and at least a post A word appearance frequency counting step for obtaining any information of document word appearance frequency information or hash tag document word appearance frequency information;
The profile word extraction means uses each word appearance frequency information acquired by the word appearance frequency counting means, and the profile document word appearance frequency and at least one of the posted document word appearance frequency and the hash tag document word appearance frequency A profile word extraction step of calculating an odds ratio with the appearance frequency, determining whether or not the odds ratio meets a set extraction condition, and extracting a word determined to match as a profile word;
A profile word extraction method characterized by comprising:

The determination as to whether or not the extraction condition in the profile word extraction step is met is performed by determining whether or not the calculated odds ratio is statistically significant by testing a confidence interval based on a preset verification condition. The profile word extracting method according to claim 3, wherein the profile word extracting method is performed by performing the following.

The profile word extraction program which makes a computer function as each means of Claim 1 or 2.