JP2013257747A

JP2013257747A - Free time estimation device, method and program

Info

Publication number: JP2013257747A
Application number: JP2012133562A
Authority: JP
Inventors: Mariko Kawaba; 真理子川場; Toru Hirano; 徹平野; Toshiaki Makino; 俊朗牧野
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2012-06-13
Filing date: 2012-06-13
Publication date: 2013-12-26

Abstract

PROBLEM TO BE SOLVED: To appropriately estimate free time of a user from a document posted on the Internet.SOLUTION: A preprocessing unit performs a morphological analysis of each blog document included in an input blog document collection, and extracts data of the posting date and time added to each blog document. A word n-gram extraction unit 141 extracts a word n-gram from the blog document collection by using the result of the morphological analysis. An average value calculation unit 142 acquires the average value of the frequency of the tweets for each time zone in a day by using the data of the posting date and time. A frequency for each time zone extraction part 143 extracts the average value of the frequency of the tweets for each time zone as identity. A standard deviation calculation unit 144 calculates standard deviation showing the dispersion of the frequency of the user's tweets in a day. An identity extraction unit 14 estimates the free time of the user by using the identity which puts together the word n-gram, the frequency of tweets for each time zone and the standard deviation of the frequency of the tweets; and a free time estimation model which is generated from the blog document collection for learning in advance.

Description

本発明は自由時間推定装置、方法、及びプログラムに係り、特に、インターネット上に投稿されたブログ等の文書に基づいて、ユーザの自由時間を推定する自由時間推定装置、方法、及びプログラムに関する。 The present invention relates to a free time estimation apparatus, method, and program, and more particularly, to a free time estimation apparatus, method, and program for estimating a user's free time based on a document such as a blog posted on the Internet.

現在、Ｔｗｉｔｔｅｒ（登録商標）のようなマイクロブログなど、ユーザがインターネット上に投稿した文書（テキストデータ）から、そのユーザの属性（性別、居住地など）を判定することが行われている（例えば、非特許文献１参照）。例えば、属性毎に関連のある単語のｎ−ｇｒａｍを素性として学習したモデルを作成しておき、このモデルを用いて、属性が未知のユーザにより投稿された文書に含まれる単語から、そのユーザの属性を推定している。このような手法では、文書に書かれた内容と属性との相関が強い場合には、性能良く属性を判定することができる。 Currently, a user's attributes (gender, residence, etc.) are determined from a document (text data) posted by the user on the Internet, such as a microblog such as Twitter (registered trademark) (for example, Non-Patent Document 1). For example, a model in which n-grams of related words for each attribute are learned as a feature is created, and from this word, the user's word is included in a document posted by a user whose attribute is unknown. Estimating attributes. With such a method, when the correlation between the content written in the document and the attribute is strong, the attribute can be determined with high performance.

大倉務、清水伸幸、中川裕志、「スケーラブルで汎用的なブログ著者属性判定手法」、社団法人情報処理学会研究報告、２００７−NL−１８１(１)Tsutomu Okura, Nobuyuki Shimizu, Hiroshi Nakagawa, “Scalable and Versatile Blog Author Attribute Determination Method”, Information Processing Society of Japan Research Report, 2007-NL-181 (1)

しかしながら、性別や居住地のような属性とは異なり、ユーザの自由時間については、ユーザが投稿した文書の内容との間に強い相関が存在しない場合が多い。ここで、自由時間とは、一定期間（例えば、１日、１週間、１ヶ月等）当たりの仕事、家事、睡眠、食事等以外の余暇の時間を示すユーザの属性である。例えば、政治の話題を頻繁にマイクロブログで投稿するユーザが複数存在する場合に、これら複数のユーザの各々の自由時間が同じであるとは考え難い。このように、投稿された文書の内容のみに基づいて、ユーザの自由時間を推定することは困難である。 However, unlike attributes such as gender and place of residence, there is often no strong correlation between the user's free time and the content of the document posted by the user. Here, the free time is an attribute of the user indicating leisure time other than work, housework, sleep, meal, etc. per certain period (for example, one day, one week, one month, etc.). For example, when there are a plurality of users who frequently post political topics on microblogs, it is difficult to think that the free times of these users are the same. Thus, it is difficult to estimate the user's free time based only on the content of the posted document.

本発明は上記事実を考慮して成されたもので、インターネット上に投稿された文書からユーザの自由時間を適切に推定することができる自由時間推定装置、方法、及びプログラムを提供することを目的とする。 The present invention has been made in consideration of the above facts, and an object thereof is to provide a free time estimation device, method, and program capable of appropriately estimating a user's free time from a document posted on the Internet. And

上記目的を達成するために、本発明の自由時間推定装置は、各々に投稿日時を示すデータが付加された同一のユーザにより投稿され複数の文書データを含む文書集合に基づいて、該文書集合に含まれる単語の出現頻度に基づく第１素性、及び一定期間を複数の期間に分割した分割期間毎の文書データの投稿数の分布を示す第２素性を抽出する素性抽出手段と、各ユーザの前記一定期間における自由時間の正解ラベルが付与された複数の学習用文書集合の各々から抽出された前記第１素性及び前記第２素性と、前記学習用文書集合の各々に付与された正解ラベルとの対応付けを学習した推定モデルと、前記一定期間における自由時間が未知の推定対象ユーザの文書集合から前記素性抽出手段により抽出された前記第１素性及び前記第２素性とに基づいて、前記推定対象ユーザの前記一定期間における自由時間を推定する推定手段と、を含んで構成されている。 In order to achieve the above object, the free time estimation device of the present invention is based on a document set including a plurality of document data posted by the same user to which data indicating posting date is added. A feature extraction means for extracting a first feature based on the appearance frequency of the included word and a second feature indicating a distribution of the number of posted document data for each divided period obtained by dividing a certain period into a plurality of periods; The first feature and the second feature extracted from each of a plurality of learning document sets to which a correct answer label of free time in a certain period is given, and a correct label given to each of the learning document set The estimation model that learned the association, and the first feature and the second feature extracted by the feature extraction means from the document set of the estimation target user whose free time in the predetermined period is unknown Zui and is configured to include a an estimation means for estimating the free time in the predetermined period of the estimated target user.

本発明の自由時間推定装置によれば、素性抽出手段が、各々に投稿日時を示すデータが付加された同一のユーザにより投稿され複数の文書データを含む文書集合に基づいて、文書集合に含まれる単語の出現頻度に基づく第１素性、及び一定期間を複数の期間に分割した分割期間毎の文書データの投稿数の分布を示す第２素性を抽出する。そして、推定手段が、各ユーザの一定期間における自由時間の正解ラベルが付与された複数の学習用文書集合の各々から抽出された第１素性及び第２素性と、学習用文書集合の各々に付与された正解ラベルとの対応付けを学習した推定モデルと、一定期間における自由時間が未知の推定対象ユーザの文書集合から素性抽出手段により抽出された第１素性及び第２素性とに基づいて、推定対象ユーザの一定期間における自由時間を推定する。 According to the free time estimation device of the present invention, the feature extraction means is included in the document set based on a document set including a plurality of document data posted by the same user to which data indicating the posting date and time is added. A first feature based on the appearance frequency of words and a second feature indicating the distribution of the number of posts of document data for each divided period obtained by dividing a certain period into a plurality of periods are extracted. Then, the estimation means assigns each of the first feature and the second feature extracted from each of the plurality of learning document sets to which each user's free time correct answer label is given, to each of the learning document sets. Based on the estimation model that learned the association with the correct label and the first feature and the second feature extracted by the feature extraction means from the document set of the estimation target user whose free time is unknown for a certain period Estimate the free time of the target user in a certain period.

このように、単語の出現頻度に基づく第１素性に加え、一定期間における分割期間毎の文書データの投稿数の分布を示す第２素性を用いることにより、インターネット上に投稿された文書からユーザの自由時間を適切に推定することができる。 In this way, in addition to the first feature based on the appearance frequency of words, by using the second feature indicating the distribution of the number of posted document data for each divided period in a certain period, the document posted on the Internet can be Free time can be estimated appropriately.

また、前記素性抽出手段は、前記第２素性として、前記分割期間毎の投稿数の平均、及び前記一定期間における前記分割期間毎の投稿数のばらつきを抽出することができる。このような第２素性を抽出することにより、ユーザの自由時間の特徴を捉えることが可能になる。 Further, the feature extraction means can extract the average of the number of posts for each of the divided periods and the variation in the number of posts for each of the divided periods in the certain period as the second feature. By extracting such second features, it is possible to capture the features of the user's free time.

また、本発明の自由時間推定装置は、前記複数の学習用文書集合を用いて、前記推定モデルを学習する学習手段を含んで構成することができる。これにより学習機能を併せ持つことができる。 In addition, the free time estimation apparatus of the present invention can be configured to include learning means for learning the estimation model using the plurality of learning document sets. Thereby, it can have a learning function.

また、本発明の自由時間推定方法は、素性抽出手段が、各々に投稿日時を示すデータが付加された同一のユーザにより投稿され複数の文書データを含む文書集合に基づいて、該文書集合に含まれる単語の出現頻度に基づく第１素性、及び一定期間を複数の期間に分割した分割期間毎の文書データの投稿数の分布を示す第２素性を抽出し、推定手段が、各ユーザの前記一定期間における自由時間の正解ラベルが付与された複数の学習用文書集合の各々から抽出された前記第１素性及び前記第２素性と、前記学習用文書集合の各々に付与された正解ラベルとの対応付けを学習した推定モデルと、前記一定期間における自由時間が未知の推定対象ユーザの文書集合から前記素性抽出手段により抽出された前記第１素性及び前記第２素性とに基づいて、前記推定対象ユーザの前記一定期間における自由時間を推定する方法である。 In the free time estimation method of the present invention, the feature extraction means is included in the document set based on a document set including a plurality of document data posted by the same user to which data indicating the posting date is added. A first feature based on the appearance frequency of the word to be extracted, and a second feature indicating a distribution of the number of postings of document data for each divided period obtained by dividing a certain period into a plurality of periods, and an estimation unit is configured to determine the predetermined feature of each user Correspondence between the first feature and the second feature extracted from each of the plurality of learning document sets to which the correct answer label of the free time in the period is given, and the correct label given to each of the learning document set On the basis of the first feature and the second feature extracted by the feature extraction means from the document set of the estimation target user whose unknown free time in the fixed period is unknown A method of estimating the free time in the predetermined period of the estimated target user.

また、本発明の自由時間推定プログラムは、コンピュータを、上記の自由時間推定装置を構成する各手段として機能させるためのプログラムである。 The free time estimation program of the present invention is a program for causing a computer to function as each means constituting the above free time estimation device.

本発明の自由時間推定装置、方法、及びプログラムによれば、単語の出現頻度に基づく第１素性に加え、一定期間における分割期間毎の文書データの投稿数の分布を示す第２素性を用いることにより、インターネット上に投稿された文書からユーザの自由時間を適切に推定することができる、という効果を有する。 According to the free time estimation device, method, and program of the present invention, in addition to the first feature based on the appearance frequency of words, the second feature indicating the distribution of the number of posted document data for each divided period in a certain period is used. Thus, it is possible to appropriately estimate the user's free time from a document posted on the Internet.

本実施の形態に係る自由時間推定装置の機能的構成を示すブロック図である。It is a block diagram which shows the functional structure of the free time estimation apparatus which concerns on this Embodiment. 素性抽出部の機能的構成を示すブロック図である。It is a block diagram which shows the functional structure of a feature extraction part. ユーザ毎のつぶやき回数の分布を示す図であるIt is a figure which shows distribution of the tweet count for every user. 本実施の形態における学習処理ルーチンを示すフローチャートである。It is a flowchart which shows the learning process routine in this Embodiment. 本実施の形態における推定処理ルーチンを示すフローチャートである。It is a flowchart which shows the estimation process routine in this Embodiment.

以下、図面を参照して本発明の実施の形態を詳細に説明する。なお、本実施の形態では、ユーザの１日当たりの自由時間を推定する場合を例に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings. In the present embodiment, a case where the user's free time per day is estimated will be described as an example.

本実施の形態に係る自由時間推定装置１０は、ＣＰＵと、ＲＡＭと、後述する学習処理及び推定処理を含む自由時間推定処理ルーチンを実行するためのプログラム及び各種データを記憶したＲＯＭとを備えたコンピュータで構成されている。また、記憶手段としてＨＤＤを設けてもよい。 A free time estimation device 10 according to the present embodiment includes a CPU, a RAM, and a ROM that stores a program and various data for executing a free time estimation processing routine including a learning process and an estimation process described later. Consists of a computer. Further, an HDD may be provided as a storage means.

このコンピュータは、機能的には、図１に示すように、学習部１１と推定部１２とを含んだ構成で表すことができ、学習部１１はさらに、前処理部１３と、素性抽出部１４と、推定モデル学習部１５とを含んだ構成で表すことができ、推定部１２はさらに、前処理部１３と、素性抽出部１４と、自由時間推定部１６とを含んだ構成で表すことができる。なお、前処理部１３及び素性抽出部１４は学習部１１と推定部１２とで共通の機能部である。 As shown in FIG. 1, this computer can be functionally represented by a configuration including a learning unit 11 and an estimation unit 12. The learning unit 11 further includes a preprocessing unit 13 and a feature extraction unit 14. And an estimation model learning unit 15. The estimation unit 12 can be expressed by a configuration including a preprocessing unit 13, a feature extraction unit 14, and a free time estimation unit 16. it can. Note that the preprocessing unit 13 and the feature extraction unit 14 are functional units common to the learning unit 11 and the estimation unit 12.

自由時間推定装置１０は、入力文書（テキストデータ）として、マイクロブログで投稿されたブログ文書を受け付ける。以下では、マイクロブログでのブログ文書の投稿、または投稿された文書自体を「つぶやき」ともいう。「つぶやき」は、最大１４０文字程度の少量のテキストデータの書き込みからなり、その内容はユーザの現在の状況や所在地、考えなどを表すことが一般的である。また、本実施の形態で扱う各ブログ文書には、ユーザがそのブログ文書を投稿した投稿日時（ユーザがつぶやいた日付及び時刻）のデータが付加されている。 The free time estimation device 10 accepts a blog document posted on a microblog as an input document (text data). Hereinafter, posting a blog document on a microblog, or a posted document itself, is also referred to as “tweet”. “Mutter” consists of writing a small amount of text data of up to about 140 characters, and its contents generally represent the current situation, location, idea, etc. of the user. Further, each blog document handled in the present embodiment is added with data of a posting date (date and time when the user tweeted) when the user posted the blog document.

また、自由時間推定装置１０への入力文書の入力は、あるユーザが所定期間（例えば、１年間）に投稿した複数のブログ文書からなるブログ文書集合単位で行われる。学習部１１に入力されるブログ文書集合は、複数のユーザ毎のブログ文書集合であり、各ブログ文書集合には、ユーザの自由時間を示す正解ラベルが予め人手で付与されている。推定部１２に入力されるブログ文書集合は、自由時間を推定したい推定対象ユーザにより所定期間に投稿されたブログ文書集合である。 The input document input to the free time estimation device 10 is performed in units of a blog document set including a plurality of blog documents posted by a certain user during a predetermined period (for example, one year). The set of blog documents input to the learning unit 11 is a set of blog documents for each of a plurality of users, and each blog document set is manually assigned a correct label indicating the user's free time. The set of blog documents input to the estimation unit 12 is a set of blog documents posted in a predetermined period by the estimation target user who wants to estimate free time.

以下、自由時間推定装置１０の各部につて詳述する。 Hereinafter, each part of the free time estimation apparatus 10 will be described in detail.

前処理部１３は、入力されたブログ文書集合に含まれる各ブログ文書を、既存の技術である形態素解析によって単語に区切り、さらに各単語に品詞情報を付与した形態素解析結果を出力する。例えば、『横浜に着いた。』というブログ文書（テキストデータ）が入力された場合、『横浜（名詞）／に（格助詞：連用）／つ（動詞語幹：Ｋ）／い（動詞活用語尾）／た（動詞接尾辞：終止）／。（句点）』という形態素解析結果が出力される。 The preprocessing unit 13 divides each blog document included in the input blog document set into words by morphological analysis, which is an existing technique, and outputs a morpheme analysis result in which part-of-speech information is added to each word. For example, “I arrived in Yokohama. When a blog document (text data) is entered, “Yokohama (noun) / ni (case particle: continuous use) / tsu (verb stem: K) / i (verb inflection ending) / ta (verb suffix: end ) /. The morphological analysis result “(punctuation)” is output.

また、前処理部１３は、各ブログ文書に付加された投稿日時のデータを抽出して出力する。 In addition, the preprocessing unit 13 extracts and outputs the posting date data added to each blog document.

素性抽出部１４は、図２に示すように、単語ｎ−ｇｒａｍ抽出部１４１と、平均値算出部１４２と、時間帯毎回数抽出部１４３と、標準偏差算出部１４４とを含んだ構成で表すことができる。 As shown in FIG. 2, the feature extraction unit 14 is represented by a configuration including a word n-gram extraction unit 141, an average value calculation unit 142, a time-period number extraction unit 143, and a standard deviation calculation unit 144. be able to.

単語ｎ−ｇｒａｍ抽出部１４１は、前処理部１３から出力された形態素解析結果を利用して、入力されたブログ文書集合から、素性として単語ｎ−ｇｒａｍを抽出する。単語ｎ−ｇｒａｍは形態素の表記とその表記のブログ文書集合内における出現頻度とで表される素性である。この素性は、ユーザが投稿したブログ文書の内容に由来するものであり、ユーザの自由時間と関連する生活スタイルや職業等が反映された情報となる。 The word n-gram extraction unit 141 extracts the word n-gram as a feature from the input blog document set using the morphological analysis result output from the preprocessing unit 13. The word n-gram is a feature represented by a morpheme notation and an appearance frequency of the notation in the blog document set. This feature is derived from the content of the blog document posted by the user, and is information that reflects the lifestyle and occupation related to the user's free time.

例えば、ブログ文書集合に含まれるブログ文書が、
横浜に着いた、横浜はいい天気。
の場合、その形態素解析結果から、以下のような単語ｎ−ｇｒａｍ（ここでは、ｎ＝１）が抽出される。
横浜：２に：１着く：１は：１いい：１天気：１ For example, if a blog document included in the blog document set is
When we arrived in Yokohama, the weather is nice.
In this case, the following word n-gram (here, n = 1) is extracted from the morphological analysis result.
Yokohama: 2 to 1: Arrive: 1 is: 1 Good: 1 Weather: 1

なお、ブログ文書に含まれる単語のうち、語彙的意味を表す内容語のみを対象として単語ｎ−ｇｒａｍを抽出してもよい。 Note that the word n-gram may be extracted from only the content words representing the lexical meaning among the words included in the blog document.

平均値算出部１４２は、前処理部１３から出力された投稿日時のデータを利用して、一定期間（ここでは１日）における時間帯毎のつぶやき回数の平均値を求める。例えば、あるユーザの午前０時〜１時のつぶやき回数が、３月２１日は３回、３月２２日は５回、３月２３日は４回だとすると、このユーザの午前０時〜１時のつぶやき回数の平均値は４となる。 The average value calculation unit 142 uses the posting date data output from the preprocessing unit 13 to determine the average value of the number of tweets for each time period in a certain period (here 1 day). For example, if a user's tweet count from midnight to 1 am is 3 times on March 21, 5 times on March 22, and 4 times on March 23, this user's midnight to 1 am The average value of the number of tweets is 4.

時間帯毎回数抽出部１４３は、平均値算出部１４２で算出された時間帯毎のつぶやき回数の平均値を素性として抽出する。例えば、あるユーザの午前０時〜１時のつぶやき回数の平均値が４回だとすると、このユーザの午前０時〜１時のつぶやき回数を表す素性は４となる。 The number-of-times extraction unit 143 extracts the average value of the number of tweets for each time period calculated by the average value calculation unit 142 as a feature. For example, if the average value of the number of tweets from midnight to 1 am for a certain user is 4, the feature representing the number of tweets from 0:00 to 1 am for this user is 4.

標準偏差算出部１４４は、平均値算出部１４２で算出された時間帯毎のつぶやき回数の平均値に基づいて、ユーザの一日のつぶやき回数のばらつきを示す標準偏差を算出し、素性として出力する。 Based on the average value of the number of tweets for each time period calculated by the average value calculation unit 142, the standard deviation calculation unit 144 calculates a standard deviation indicating the variation in the number of tweets per day of the user and outputs it as a feature. .

ここで、図３に示すように、つぶやき回数の時間帯毎の分布はユーザ毎に異なる。また、つぶやき回数の分布とユーザの自由時間との間には相関があると考えられる。例えば、どの時間帯も一定の頻度でつぶやいているユーザは自由時間が多いと考えられ、特定の時間帯のみつぶやいているユーザは、それ以外の時間は仕事や家事などの時間であり、自由時間が少ないと考えられる。このように、時間帯毎のつぶやき回数及びつぶやき回数の標準偏差のような、つぶやき回数の分布を素性として利用することで、上記のようなユーザの自由時間の特徴を捉えることができる。 Here, as shown in FIG. 3, the distribution of the number of tweets for each time zone is different for each user. Further, it is considered that there is a correlation between the distribution of the number of tweets and the user's free time. For example, a user who tweetes at a certain frequency in any time zone is considered to have a lot of free time, and a user who tweetes only in a specific time zone is a time for work or housework, and the free time It is thought that there are few. Thus, by using the tweet count distribution such as the tweet count for each time zone and the standard deviation of the tweet count as a feature, it is possible to capture the characteristics of the user's free time as described above.

素性抽出部１４は、上記の単語ｎ−ｇｒａｍ、時間帯毎のつぶやき回数、及びつぶやき回数の標準偏差をまとめて、１つのブログ文書集合から抽出された素性として出力する。上記のように、単語ｎ−ｇｒａｍはユーザの自由時間に関連する生活スタイルや職業等が反映された情報であり、つぶやき回数の分布はユーザの自由時間の特徴を捉えた情報であるため、これらをまとめて素性として用いることで、自由時間推定モデル２０の学習及び自由時間の推定を精度良く行うことができる。 The feature extraction unit 14 collects the above word n-gram, the number of tweets for each time zone, and the standard deviation of the number of tweets, and outputs them as features extracted from one blog document set. As mentioned above, the word n-gram is information reflecting the lifestyle and occupation related to the user's free time, and the distribution of the number of tweets is information that captures the characteristics of the user's free time. Are used as features, so that the learning of the free time estimation model 20 and the estimation of the free time can be performed with high accuracy.

なお、学習部１１における素性抽出部１４として機能する場合には、ブログ文書集合毎に抽出した素性と、そのブログ文書集合に付与された正解ラベルとをペアにして、後段の推定モデル学習部１５へ受け渡す。推定部１２における素性抽出部１４として機能する場合には、抽出した素性を後段の自由時間推定部１６へ受け渡す。 When the learning unit 11 functions as the feature extraction unit 14, the feature extracted for each blog document set and the correct answer label assigned to the blog document set are paired, and the estimated model learning unit 15 in the subsequent stage. Hand over to. When functioning as the feature extraction unit 14 in the estimation unit 12, the extracted feature is transferred to the subsequent free time estimation unit 16.

推定モデル学習部１５は、素性抽出部１４から出力された素性と正解ラベルとのペアの対応付けを既存の技術を用いて学習して、ユーザの自由時間を推定するための自由時間推定モデル２０を生成する。例えば、回帰分析により自由時間推定モデル２０を生成することができる。 The estimation model learning unit 15 learns the association between the feature and correct label output from the feature extraction unit 14 using an existing technique, and estimates the free time estimation model 20 for estimating the user's free time. Is generated. For example, the free time estimation model 20 can be generated by regression analysis.

自由時間推定部１６は、推定モデル学習部１５により生成された自由時間推定モデル２０と、推定対象ユーザのブログ文書集合から抽出された素性とを用いて、ユーザの自由時間を推定して出力する。 The free time estimation unit 16 estimates and outputs the user's free time using the free time estimation model 20 generated by the estimation model learning unit 15 and the features extracted from the blog document set of the estimation target user. .

次に、本実施の形態に係る自由時間推定装置１０の作用について説明する。学習段階において、複数のユーザ毎のブログ文書集合が自由時間推定装置１０に入力されると、学習部１１において、図４に示す学習処理ルーチンが実行される。また、推定段階において、自由時間を推定したい推定対象ユーザのブログ文書集合が自由時間推定装置１０に入力されると、推定部１２において、図５に示す推定処理ルーチンが実行される。以下、各処理について詳述する。 Next, the operation of the free time estimation apparatus 10 according to the present embodiment will be described. When a set of blog documents for each of a plurality of users is input to the free time estimation apparatus 10 in the learning stage, the learning processing routine shown in FIG. In the estimation stage, when a set of blog documents of an estimation target user whose free time is to be estimated is input to the free time estimation apparatus 10, the estimation unit 12 executes an estimation processing routine shown in FIG. Hereinafter, each process is explained in full detail.

まず、学習処理ルーチンでは、ステップ１００で、前処理部１３が、入力された複数のユーザ毎のブログ文書集合を取得する。次に、ステップ１０２で、前処理部１３が、複数のブログ文書集合の中から、１人のユーザのブログ文書集合を選択する。次に、ステップ１０４で、前処理部１３が、選択したブログ文書集合に含まれる各ブログ文書を、既存の技術である形態素解析によって単語に区切り、さらに各単語に品詞情報を付与した形態素解析結果を出力する。また、前処理部１３が、各ブログ文書に付加された投稿日時のデータを抽出して出力する。 First, in the learning processing routine, in step 100, the preprocessing unit 13 acquires a set of input blog documents for each of a plurality of users. Next, in step 102, the preprocessing unit 13 selects a blog document set of one user from a plurality of blog document sets. Next, in step 104, the preprocessing unit 13 divides each blog document included in the selected blog document set into words by morphological analysis, which is an existing technology, and further adds part-of-speech information to each word. Is output. In addition, the preprocessing unit 13 extracts and outputs the posting date data added to each blog document.

次に、ステップ１０６で、単語ｎ−ｇｒａｍ抽出部１４１が、上記ステップ１０４で出力された形態素解析結果を利用して、上記ステップ１０２で選択されたブログ文書集合から単語ｎ−ｇｒａｍを抽出する。 Next, in step 106, the word n-gram extraction unit 141 extracts the word n-gram from the blog document set selected in step 102 using the morpheme analysis result output in step 104.

次に、ステップ１０８で、平均値算出部１４２が、上記ステップ１０４で出力された投稿日時のデータを利用して、一定期間における時間帯毎のつぶやき回数の平均値を求める。次に、ステップ１１０で、時間帯毎回数抽出部１４３が、上記ステップ１０８で算出された時間帯毎のつぶやき回数の平均値を素性として抽出する。次に、ステップ１１２で、標準偏差算出部１４４は、上記ステップ１０８で算出された時間帯毎のつぶやき回数の平均値に基づいて、ユーザの一日のつぶやき回数のばらつきを示す標準偏差を算出し、素性として出力する。 Next, in step 108, the average value calculation unit 142 uses the posting date data output in step 104 to obtain the average value of the number of tweets for each time period in a certain period. Next, in step 110, the number of times extraction unit 143 for each time period extracts the average value of the number of tweets for each time period calculated in step 108 as a feature. Next, in step 112, the standard deviation calculation unit 144 calculates a standard deviation indicating the variation in the number of tweets per day based on the average value of the number of tweets for each time period calculated in step 108. , Output as a feature.

次に、ステップ１１４で、素性抽出部１４が、上記ステップ１０６で抽出された単語ｎ−ｇｒａｍ、上記ステップ１１０で抽出された時間帯毎のつぶやき回数、及び上記ステップ１１２で算出されたつぶやき回数の標準偏差をまとめて、上記ステップ１０２で選択されたブログ文書集合の素性とし、そのブログ文書集合に付与された自由時間の正解ラベルとのペアを作成する。 Next, in step 114, the feature extraction unit 14 determines the word n-gram extracted in step 106, the number of tweets for each time period extracted in step 110, and the number of tweets calculated in step 112. The standard deviations are collected and used as the features of the blog document set selected in step 102, and a pair with the correct answer label for the free time given to the blog document set is created.

次に、ステップ１１６で、学習部１１が、ブログ文書集合が入力された全てのユーザについて正解ラベルと素性とのペアを作成する処理が終了したか否かを判定し、未処理のユーザが存在する場合には、ステップ１０２へ戻って、次のユーザのブログ文書集合を選択して、ステップ１０４〜１１４の処理を繰り返す。全てのユーザについて処理が終了した場合には、ステップ１１８へ移行し、推定モデル学習部１５が、上記ステップ１１４で作成された複数の素性と正解ラベルとのペアを学習して、自由時間推定モデル２０を生成する。生成した自由時間推定モデル２０は所定の記憶領域に記憶して、学習処理ルーチンを終了する。 Next, in step 116, the learning unit 11 determines whether or not the process of creating the correct label / feature pair has been completed for all users who have received the blog document set, and there are unprocessed users. If so, the process returns to step 102, the next user's blog document set is selected, and the processing of steps 104 to 114 is repeated. When the processing has been completed for all users, the process proceeds to step 118, where the estimation model learning unit 15 learns the pairs of features and correct labels created in step 114, and determines the free time estimation model. 20 is generated. The generated free time estimation model 20 is stored in a predetermined storage area, and the learning processing routine is terminated.

次に、推定処理ルーチンでは、ステップ１２０で、前処理部１３が、入力された推定対象ユーザのブログ文書集合を取得する。次に、ステップ１２２〜１３０で、前処理部１３及び素性抽出部１４が、学習処理のステップ１０４〜１１２と同様の処理により、入力されたブログ文書集合の素性を抽出する。 Next, in the estimation processing routine, in step 120, the preprocessing unit 13 acquires the input blog document set of the estimation target user. Next, in steps 122 to 130, the preprocessing unit 13 and the feature extraction unit 14 extract the features of the input blog document set by the same processing as the learning processing steps 104 to 112.

次に、ステップ１３２で、自由時間推定部１６が、学習処理で生成された自由時間推定モデル２０と、推定対象ユーザのブログ文書集合から抽出された素性とを用いて、ユーザの自由時間を推定し、推定結果を出力して、推定処理ルーチンを終了する。 Next, in step 132, the free time estimation unit 16 estimates the user's free time using the free time estimation model 20 generated by the learning process and the features extracted from the blog document set of the estimation target user. Then, the estimation result is output and the estimation processing routine is terminated.

以上説明したように、本実施の形態に係る自由時間推定装置１０によれば、文書の投稿回数の分布を利用した素性を用いることにより、インターネット上に投稿された文書からユーザの自由時間を適切に推定することができる。 As described above, according to the free time estimation device 10 according to the present embodiment, the user's free time is appropriately set from a document posted on the Internet by using the feature using the distribution of the number of postings of the document. Can be estimated.

なお、本発明は、上述した実施形態に限定されるものではなく、この発明の要旨を逸脱しない範囲内で様々な変形や応用が可能である。 Note that the present invention is not limited to the above-described embodiment, and various modifications and applications are possible without departing from the gist of the present invention.

例えば、上記実施の形態では、学習部と推定部とを１つのコンピュータで構成する場合について説明したが、別々のコンピュータで構成するようにしてもよい。 For example, although the case where the learning unit and the estimation unit are configured by one computer has been described in the above embodiment, the learning unit and the estimation unit may be configured by separate computers.

また、上記実施の形態では、１日当たりの自由時間を推定する場合を例に説明したが、より長い一定期間、例えば１週間や１ヶ月当たりの自由時間を推定するようにしてもよいし、より短い一定期間、例えば８時から２２時までの自由時間を推定するようにしてもよい。 In the above embodiment, the case where the free time per day is estimated has been described as an example. However, the free time per longer period, for example, one week or one month may be estimated. You may make it estimate the free time from a short fixed period, for example, from 8:00 to 22:00.

また、上記の実施の形態では、つぶやき回数の分布を示す素性として、時間帯毎のつぶやき回数の平均値、及び１日における時間帯毎のつぶやき回数の標準偏差を用いる場合について説明したが、これに限定されない。例えば、時間帯毎のつぶやきの頻度、累積頻度、１日におけるつぶやき回数の分散、偏差等を用いてもよい。 In the above embodiment, as the feature indicating the distribution of the number of tweets, the average value of the number of tweets for each time period and the standard deviation of the number of tweets for each time period in one day have been described. It is not limited to. For example, the frequency of tweets for each time zone, the cumulative frequency, the variance of the number of tweets per day, the deviation, etc. may be used.

また、上述の自由時間推定装置は、内部にコンピュータシステムを有しているが、「コンピュータシステム」は、ＷＷＷシステムを利用している場合であれば、ホームページ提供環境（あるいは表示環境）も含むものとする。 Moreover, although the above-mentioned free time estimation apparatus has a computer system inside, if a "computer system" is using the WWW system, it shall also include a homepage provision environment (or display environment). .

また、本願明細書中において、プログラムが予めインストールされている実施形態として説明したが、当該プログラムを、コンピュータ読み取り可能な記録媒体に格納して提供することも可能である。また、本発明は、周知のコンピュータに媒体もしくは通信回線を介して、プログラムをインストールすることによっても実現可能である。 In the present specification, the embodiment has been described in which the program is installed in advance. However, the program can be provided by being stored in a computer-readable recording medium. The present invention can also be realized by installing a program on a known computer via a medium or a communication line.

１０自由時間推定装置
１１学習部
１２推定部
１３前処理部
１４素性抽出部
１５推定モデル学習部
１６自由時間推定部
２０自由時間推定モデル
１４１単語ｎ−ｇｒａｍ抽出部
１４２平均値算出部
１４３時間帯毎回数抽出部
１４４標準偏差算出部 DESCRIPTION OF SYMBOLS 10 Free time estimation apparatus 11 Learning part 12 Estimation part 13 Preprocessing part 14 Feature extraction part 15 Estimation model learning part 16 Free time estimation part 20 Free time estimation model 141 Word n-gram extraction part 142 Average value calculation part 143 Every time zone Number extraction unit 144 Standard deviation calculation unit

Claims

Based on a document set including a plurality of document data posted by the same user to which data indicating the posting date and time is added, a first feature based on the appearance frequency of words included in the document set and a plurality of fixed periods Feature extraction means for extracting a second feature indicating the distribution of the number of postings of document data for each divided period divided into the periods;
The first feature and the second feature extracted from each of the plurality of learning document sets to which the correct answer label for the free time in the predetermined period of each user is assigned, and each of the learning document sets is assigned to each of the learning document sets Based on the estimation model that learned the correspondence with the correct answer label, and the first feature and the second feature extracted by the feature extraction unit from the document set of the estimation target user whose free time in the fixed period is unknown Estimating means for estimating a free time in the fixed period of the estimation target user;
A free time estimation device including:

The free time estimation apparatus according to claim 1, wherein the feature extraction unit extracts, as the second feature, an average of the number of posts for each of the divided periods and a variation in the number of posts for each of the divided periods in the certain period.

The free time estimation apparatus according to claim 1, further comprising learning means for learning the estimation model using the plurality of learning document sets.

A first feature based on an appearance frequency of words included in the document set based on a document set including a plurality of document data posted by the same user to which data indicating the posting date and time is added to each And extracting a second feature indicating the distribution of the number of submissions of document data for each divided period obtained by dividing a certain period into a plurality of periods
The estimation means includes the first feature and the second feature extracted from each of the plurality of learning document sets to which the correct answer label of the free time in the certain period of each user is assigned, and each of the learning document sets The first feature and the second feature extracted by the feature extraction means from the document set of the estimation target user whose unknown free time in the fixed period is unknown, and learning the association with the correct label given to Based on the above, a free time estimation method for estimating a free time in the certain period of the estimation target user.

The free time estimation program for functioning a computer as each means which comprises the free time estimation apparatus of any one of Claims 1-3.