JP2008123062A

JP2008123062A - Device, method, and program for classifying content

Info

Publication number: JP2008123062A
Application number: JP2006303397A
Authority: JP
Inventors: Kaori Tanio; 香里谷尾; Takeshi Masuyama; 毅司増山
Original assignee: Yahoo Japan Corp
Current assignee: Yahoo Japan Corp
Priority date: 2006-11-08
Filing date: 2006-11-08
Publication date: 2008-05-29
Anticipated expiration: 2026-11-08
Also published as: JP5013821B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a device for classifying Web pages such as blog for every region from the details of the content of the Web page. <P>SOLUTION: This device 10 for classifying the content of a Web page or the like is provided with: a means 110 for performing the morphemic analysis of training data 30 including descriptions related with a residence area in the content; a means 120 for extracting predetermined morphemes from the result of the morphemic analysis; a means 130 for calculating mean mutual information quantity between the extracted morphemes and residence categories; a means 107 for storing data acquired by associating the residence categories, the extracted morphemes and the mean mutual information quantity between the residence categories and the morphemes with each other; and a means 140 for classifying the input predetermined content into residence categories based on the data stored in the means 107. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は、コンテンツを分類する装置、方法、プログラムに関する。 The present invention relates to an apparatus, a method, and a program for classifying content.

例えば、ブログと呼ばれる、日々更新する日記的なＷｅｂページが知られている。このブログは、他者のブログに対するコメントを、自サイトの日記のネタとして利用する際に、先方に記事の引用を知らせるとともに、自分の記載したコメントを先方に自動送信する「トラックバック機能」等を備えることで、幅広い意見交換が期待できるＷｅｂページとして知られている。 For example, a diary Web page called a blog that is updated daily is known. This blog uses a “trackback function” that automatically sends the comments you wrote to the other party, while also quoting the article when using comments about other people's blogs as the story of your site's diary. It is known as a Web page that can be expected to exchange a wide range of opinions.

近年、このブログを提供するＷｅｂページが増加し、趣向の異なるたくさんのブログが存在している。このため、ブログのサービスを提供する事業者等にとっては、ブログを所定の基準で分類できることが望ましい。例えば、ブログは、分類するカテゴリとして地域別に分けることが可能である。 In recent years, the number of Web pages providing this blog has increased, and there are many blogs with different tastes. For this reason, it is desirable for businesses that provide blog services to be able to classify blogs according to predetermined criteria. For example, blogs can be classified by region as a category to be classified.

一方、Ｗｅｂページを特徴別に分類する方法が知られている（例えば、特許文献１）。特許文献１によれば、Ｗｅｂページに対して行われる作成者及び閲覧者の行動の統計値に基づいて、そのＷｅｂページのネットワーク上における役割や性格の特徴を推定し、分類することが可能となる。
特開２００６−１６３９９７号公報 On the other hand, a method for classifying Web pages by feature is known (for example, Patent Document 1). According to Patent Document 1, it is possible to estimate and categorize the role and personality of the Web page on the network based on the statistical values of the actions of creators and viewers performed on the Web page. Become.
JP 2006-163997 A

しかしながら、特許文献１を用いても、統計情報として用いられる情報が、アクセス数やコメント数等のＷｅｂページの閲覧における基本的な統計情報に限られるため、例えば、Ｗｅｂページのコンテンツに適した分類ができるとは限らない。すなわち、上述のように、ブログをそのコンテンツの内容から地域別に分類することは困難である。 However, even if Patent Document 1 is used, information used as statistical information is limited to basic statistical information in browsing a Web page such as the number of accesses and the number of comments. It is not always possible. That is, as described above, it is difficult to classify blogs by region based on the content of the content.

本発明は、ブログのようなＷｅｂページを、そのＷｅｂページのコンテンツの内容から、地域別に分類し、コンテンツの作成者の居住エリアを類推する方法、装置、プログラムを提供することを目的とする。 An object of the present invention is to provide a method, an apparatus, and a program for classifying a web page such as a blog by region from the content of the content of the web page and inferring the residence area of the creator of the content.

（１）コンテンツの作成者の居住エリアを類推する装置であって、
前記コンテンツのうち、居住エリアに関係する記載を含むトレーニングデータを形態素解析する手段と、
前記形態素解析の結果から所定の形態素を抽出する手段と、
前記抽出した形態素と居住カテゴリとの間の平均相互情報量を算出する手段と、
前記居住カテゴリと、抽出した形態素と、当該居住カテゴリと形態素との平均相互情報量とが関係づけられたデータを記憶する手段と、
前記記憶する手段に記憶されたデータに基づいて、入力された所定のコンテンツを当該居住カテゴリに分類する手段と、
を備えることを特徴とする装置。 (1) An apparatus for analogizing the residence area of a content creator,
Among the contents, means for morphological analysis of training data including a description related to a residential area;
Means for extracting a predetermined morpheme from the result of the morpheme analysis;
Means for calculating an average mutual information amount between the extracted morpheme and the residence category;
Means for storing data in which the residence category, the extracted morpheme, and the average mutual information amount of the residence category and the morpheme are related;
Means for classifying the inputted predetermined content into the residence category based on the data stored in the means for storing;
A device comprising:

（１）の装置は、コンテンツのうち、居住エリアに関係する記載を含むトレーニングデータを形態素解析し、形態素解析の結果から所定の形態素を抽出し、抽出した形態素と居住カテゴリとの間の平均相互情報量を算出し、居住カテゴリと、抽出した形態素と、当該居住カテゴリと形態素との平均相互情報量とが関係づけられたデータを記憶し、記憶する手段に記憶されたデータに基づいて、入力された所定のコンテンツを当該居住カテゴリに分類する。 The device of (1) performs morphological analysis on training data including descriptions related to the residential area in the content, extracts a predetermined morpheme from the result of the morphological analysis, and averages the average morpheme between the extracted morpheme and the residential category. Calculate the amount of information, store the data related to the residence category, the extracted morpheme, and the average mutual information amount of the residence category and the morpheme, and input based on the data stored in the storage means The predetermined content is classified into the residence category.

よって、入力された所定のコンテンツを、トレーニングデータにより関係づけられて記憶された平均相互情報量に基づいて、居住カテゴリごとに分類することが可能である。したがって、例えば、ブログのようなＷｅｂページを、そのブログのコンテンツの内容から、地域別に分類することで、コンテンツの作成者の居住エリアを類推することが可能である。 Therefore, it is possible to classify the input predetermined content for each residence category based on the average mutual information stored in association with the training data. Therefore, for example, by classifying a Web page such as a blog by region from the content of the content of the blog, it is possible to infer the residence area of the creator of the content.

（２）（１）に記載のコンテンツを分類する装置であって、
前記平均相互情報量を算出する手段は、Ｐを確率として、

MI(w,c)：形態素となる単語wとカテゴリｃの間の平均相互情報量
により、平均相互情報量を算出することを特徴とする装置。 (2) A device for classifying the content described in (1),
The means for calculating the average mutual information amount has P as a probability,

MI (w, c): an apparatus for calculating an average mutual information amount from an average mutual information amount between a word w as a morpheme and a category c.

（３）（１）又は（２）に記載のコンテンツを分類する装置であって、
前記平均相互情報量を算出する手段は、

MI(w,c)：形態素となる単語wとカテゴリｃの間の平均相互情報量
ｅ：カテゴリｃの語を有し、かつ、形態素の単語を有するものの集合の数
ｆ：カテゴリｃの語を有し、かつ、形態素の単語を有しないものの集合の数
ｇ：カテゴリｃの語を有さず、かつ、形態素の単語を有するものの集合の数
ｈ：カテゴリｃの語を有さず、かつ、形態素の単語を有しないものの集合の数
Ｎ＝ｅ＋ｆ＋ｇ＋ｈ
により、平均相互情報量を算出することを特徴とする装置。 (3) A device for classifying the content described in (1) or (2),
The means for calculating the average mutual information amount is:

MI (w, c): Average mutual information between morpheme word w and category c e: Number of sets having category c words and morpheme words f: Category c words And the number of sets of those having no morpheme words g: the number of sets having no words of category c and having the words of morphemes h: having no words of category c, and Number of sets of words that do not have morpheme words N = e + f + g + h
An apparatus for calculating an average mutual information amount by

（４）装置が、コンテンツを分類する方法であって、
前記コンテンツのうち、居住エリアに関係する記載を含むトレーニングデータを形態素解析するステップと、
前記形態素解析の結果から所定の形態素を抽出するステップと、
前記抽出した形態素と居住カテゴリとの間の平均相互情報量を算出するステップと、
前記居住カテゴリと、抽出した形態素と、当該居住カテゴリと形態素との平均相互情報量とが関係づけられたデータを記憶するステップと、
前記記憶するステップにて記憶されたデータに基づいて、入力された所定のコンテンツを当該居住カテゴリに分類するステップと、
を備えることを特徴とする方法。 (4) A method in which a device classifies content,
Morphological analysis of training data including a description related to a residential area of the content;
Extracting a predetermined morpheme from the result of the morpheme analysis;
Calculating an average mutual information amount between the extracted morpheme and a residence category;
Storing the data in which the residence category, the extracted morpheme, and the average mutual information amount of the residence category and the morpheme are related;
Classifying the input predetermined content into the residence category based on the data stored in the storing step;
A method comprising the steps of:

（４）の方法は、コンテンツのうち、居住エリアに関係する記載を含むトレーニングデータを形態素解析し、形態素解析の結果から所定の形態素を抽出し、抽出した形態素と居住カテゴリとの間の平均相互情報量を算出し、居住カテゴリと、抽出した形態素と、当該居住カテゴリと形態素との平均相互情報量とが関係づけられたデータを記憶し、記憶する手段に記憶されたデータに基づいて、入力された所定のコンテンツを当該居住カテゴリに分類する。 In the method (4), morphological analysis is performed on training data including descriptions related to the residential area in the content, a predetermined morpheme is extracted from the result of the morphological analysis, and the average mutual between the extracted morpheme and the residential category is extracted. Calculate the amount of information, store the data related to the residence category, the extracted morpheme, and the average mutual information amount of the residence category and the morpheme, and input based on the data stored in the storage means The predetermined content is classified into the residence category.

よって、入力された所定のコンテンツを、トレーニングデータにより関係づけられて記憶された平均相互情報量に基づいて、居住カテゴリごとに分類することが可能である。したがって、例えば、ブログのようなＷｅｂページを、そのブログのコンテンツの内容から、地域別に分類することが可能である。 Therefore, it is possible to classify the input predetermined content for each residence category based on the average mutual information stored in association with the training data. Therefore, for example, a web page such as a blog can be classified by region from the content of the content of the blog.

（５）（４）に記載のコンテンツを分類する方法であって、
前記平均相互情報量を算出するステップでは、Ｐを確率として、

MI(w,c)：形態素となる単語wとカテゴリｃの間の平均相互情報量
により、平均相互情報量を算出することを特徴とする方法。 (5) A method for classifying the content described in (4),
In the step of calculating the average mutual information amount, P is a probability,

MI (w, c): A method of calculating an average mutual information amount from an average mutual information amount between a word w as a morpheme and a category c.

（６）（４）又は（５）に記載のコンテンツを分類する方法であって、
前記平均相互情報量を算出するステップでは、

MI(w,c)：形態素となる単語wとカテゴリｃの間の平均相互情報量
ｅ：カテゴリｃの語を有し、かつ、形態素の単語を有するものの集合の数
ｆ：カテゴリｃの語を有し、かつ、形態素の単語を有しないものの集合の数
ｇ：カテゴリｃの語を有さず、かつ、形態素の単語を有するものの集合の数
ｈ：カテゴリｃの語を有さず、かつ、形態素の単語を有しないものの集合の数
Ｎ＝ｅ＋ｆ＋ｇ＋ｈ
により、平均相互情報量を算出することを特徴とする方法。 (6) A method for classifying content described in (4) or (5),
In the step of calculating the average mutual information amount,

MI (w, c): Average mutual information between morpheme word w and category c e: Number of sets having category c words and morpheme words f: Category c words And the number of sets of those having no morpheme words g: the number of sets having no words of category c and having the words of morphemes h: having no words of category c, and Number of sets of words that do not have morpheme words N = e + f + g + h
A method of calculating an average mutual information amount by:

（７）コンテンツを分類する装置に対して、
前記コンテンツのうち、居住エリアに関係する記載を含むトレーニングデータを形態素解析するステップと、
前記形態素解析の結果から所定の形態素を抽出するステップと、
前記抽出した形態素と居住カテゴリとの間の平均相互情報量を算出するステップと、
前記居住カテゴリと、抽出した形態素と、当該居住カテゴリと形態素との平均相互情報量とが関係づけられたデータを記憶するステップと、
前記記憶するステップにて記憶されたデータに基づいて、入力された所定のコンテンツを当該居住カテゴリに分類するステップと、
を実行させるためのプログラム。 (7) For devices that classify content,
Morphological analysis of training data including a description related to a residential area of the content;
Extracting a predetermined morpheme from the result of the morpheme analysis;
Calculating an average mutual information amount between the extracted morpheme and a residence category;
Storing the data relating the residence category, the extracted morpheme, and the average mutual information amount of the residence category and the morpheme;
Classifying the input predetermined content into the residence category based on the data stored in the storing step;
A program for running

（７）のプログラムは、コンテンツのうち、居住エリアに関係する記載を含むトレーニングデータを形態素解析し、形態素解析の結果から所定の形態素を抽出し、抽出した形態素と居住カテゴリとの間の平均相互情報量を算出し、居住カテゴリと、抽出した形態素と、当該居住カテゴリと形態素との平均相互情報量とが関係づけられたデータを記憶し、記憶する手段に記憶されたデータに基づいて、入力された所定のコンテンツを当該居住カテゴリに分類する。 The program of (7) performs morphological analysis on training data including descriptions related to the residential area in the content, extracts a predetermined morpheme from the result of the morphological analysis, and calculates the average mutual between the extracted morpheme and the residential category. Calculate the amount of information, store the data related to the residence category, the extracted morpheme, and the average mutual information amount of the residence category and the morpheme, and input based on the data stored in the storage means The predetermined content is classified into the residence category.

よって、入力された所定のコンテンツを、トレーニングデータにより関係づけられて記憶された平均相互情報量に基づいて、居住カテゴリごとに分類することが可能である。したがって、例えば、ブログのようなＷｅｂページを、そのブログのコンテンツの内容から、地域別に分類することでコンテンツの作成者の居住エリアを類推することが可能である。 Therefore, it is possible to classify the input predetermined content for each residence category based on the average mutual information stored in association with the training data. Therefore, for example, it is possible to infer the residence area of the creator of the content by classifying Web pages such as a blog by region from the content of the content of the blog.

（８）ブログに関するＷｅｂページを分類する装置であって、
前記Ｗｅｂページのうち、居住エリアに関係する記載を含むトレーニングデータを形態素解析する手段と、
前記形態素解析の結果から所定の形態素を抽出する手段と、
前記抽出した形態素と居住カテゴリとの間の平均相互情報量を算出する手段と、
前記居住カテゴリと、抽出した形態素と、当該居住カテゴリと形態素との平均相互情報量とが関係づけられたデータを記憶する手段と、
前記記憶する手段に記憶されたデータに基づいて、入力された所定のＷｅｂページを当該居住カテゴリに分類する手段と、
を備えることを特徴とする装置。 (8) A device for classifying Web pages related to blogs,
A means for morphological analysis of training data including a description related to a living area in the web page;
Means for extracting a predetermined morpheme from the result of the morpheme analysis;
Means for calculating an average mutual information amount between the extracted morpheme and the residence category;
Means for storing data in which the residence category, the extracted morpheme, and the average mutual information amount of the residence category and the morpheme are related;
Means for classifying the inputted predetermined web page into the residence category based on the data stored in the means for storing;
A device comprising:

（８）の装置は、Ｗｅｂページのうち、居住エリアに関係する記載を含むトレーニングデータを形態素解析し、形態素解析の結果から所定の形態素を抽出し、抽出した形態素と居住カテゴリとの間の平均相互情報量を算出し、居住カテゴリと、抽出した形態素と、当該居住カテゴリと形態素との平均相互情報量とが関係づけられたデータを記憶し、記憶する手段に記憶されたデータに基づいて、入力された所定のＷｅｂページを当該居住カテゴリに分類する。 The apparatus of (8) performs morphological analysis on training data including descriptions related to the residential area in the Web page, extracts a predetermined morpheme from the result of the morphological analysis, and averages between the extracted morpheme and the residential category Based on the data stored in the means for calculating the mutual information, storing the relationship between the residence category, the extracted morpheme, and the average mutual information amount of the residence category and the morpheme, The inputted predetermined web page is classified into the residence category.

よって、入力された所定のＷｅｂページを、トレーニングデータにより関係づけられて記憶された平均相互情報量に基づいて、居住カテゴリごとに分類することが可能である。したがって、例えば、ブログのようなＷｅｂページを、そのブログのコンテンツの内容から、地域別に分類することで、コンテンツの作成者の居住エリアを類推することが可能である。 Therefore, it is possible to classify the input predetermined web pages for each residence category based on the average mutual information stored in association with the training data. Therefore, for example, by classifying a Web page such as a blog by region from the content of the content of the blog, it is possible to infer the residence area of the creator of the content.

本発明によれば、入力された所定のコンテンツを、トレーニングデータにより関係づけられて記憶された平均相互情報量に基づいて、居住カテゴリごとに分類することが可能である。したがって、例えば、ブログのようなＷｅｂページを、そのブログのコンテンツの内容から、地域別に分類することが可能である。 According to the present invention, it is possible to classify input predetermined content for each residence category based on the average mutual information stored in relation to the training data. Therefore, for example, a web page such as a blog can be classified by region from the content of the content of the blog.

以下、本発明の実施形態について図面に基づいて説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

図１は、本発明の好適な実施形態に係る装置１０の機能ブロック及び処理のフローを示す図である。装置１０は、データを制御する制御部１０１と、データが記憶されるデータ記憶部１０７とを少なくとも備える。装置１０は、後述する通信部や入力部を備え、Ｗｅｂページ４０やトレーニングデータ３０等のコンテンツに関するデータが入力される。 FIG. 1 is a diagram showing functional blocks and a processing flow of an apparatus 10 according to a preferred embodiment of the present invention. The apparatus 10 includes at least a control unit 101 that controls data and a data storage unit 107 that stores data. The apparatus 10 includes a communication unit and an input unit, which will be described later, and data related to content such as a web page 40 and training data 30 is input.

制御部１０１は、学習部１０５として、入力されたトレーニングデータ３０の形態素解析を行う形態素解析部１１０と、形態素解析の結果から形態素を抽出する形態素抽出部１２０と、抽出した形態素と居住カテゴリとの間の平均相互情報量を算出する平均相互情報量算出部１３０と、を備え、さらに、記憶されたデータに基づいて、入力されたＷｅｂページ４０を居住カテゴリに分類する分類部１４０とを備える。 As the learning unit 105, the control unit 101 performs a morpheme analysis unit 110 that performs morpheme analysis of the input training data 30, a morpheme extraction unit 120 that extracts a morpheme from the result of the morpheme analysis, and the extracted morpheme and residence category. An average mutual information amount calculation unit 130 that calculates an average mutual information amount between them, and a classification unit 140 that classifies the input Web page 40 into a residence category based on the stored data.

トレーニングデータ３０やＷｅｂページ４０は、ブログに関するコンテンツ、データであってよく、地域等の居住エリアに関する情報が少なくとも一つは含まれているページやデータである。特に、トレーニングデータは、Ｗｅｂページのうち、居住エリアを分類するためのデータを装置１０に記憶させるためのトレーニングデータであってよい。 The training data 30 and the web page 40 may be content and data related to a blog, and are pages and data that include at least one information related to a residential area such as a region. In particular, the training data may be training data for causing the device 10 to store data for classifying the living area in the web page.

次に、これらの構成により実行される処理について説明する。入力されたトレーニングデータ３０には、上述のように地域等の居住エリアに関する文字データを含んだ文章が含まれる。この文章に対して、形態素解析部１１０は、形態素解析を行い、文章（複数の単語から構成される文字データ）を単語ごとに分けて、分けた単語を品詞ごとに分類する（ステップＳ０１）。 Next, processing executed by these configurations will be described. The input training data 30 includes text including character data relating to a residential area such as a region as described above. The morphological analysis unit 110 performs morphological analysis on the sentence, divides the sentence (character data composed of a plurality of words) for each word, and classifies the divided words for each part of speech (step S01).

なお、形態素解析部１１０が、形態素解析を行う前処理として、ストップワード処理（観光、旅、滞在等の単語を含むＷｅｂページを除去する処理）を行い、厳選されたＷｅｂページのみをトレーニングデータとして処理してもよい。 The morphological analysis unit 110 performs stop word processing (processing for removing web pages including words such as tourism, travel, stay, etc.) as preprocessing for performing morphological analysis, and only carefully selected web pages are used as training data. It may be processed.

次に、形態素抽出部１２０は、形態素解析部１１０が解析した単語を、品詞によりフィルタリングする（ステップＳ０２）。具体的には、形態素抽出部１２０は、名詞のみを抽出する。 Next, the morpheme extraction unit 120 filters the words analyzed by the morpheme analysis unit 110 using the part of speech (step S02). Specifically, the morpheme extraction unit 120 extracts only nouns.

次に、平均相互情報量算出部１３０は、形態素抽出部１２０が抽出した形態素と居住カテゴリとの間の、平均相互情報量を算出する（ステップＳ０３）。 Next, the average mutual information amount calculation unit 130 calculates an average mutual information amount between the morpheme extracted by the morpheme extraction unit 120 and the residence category (step S03).

居住カテゴリとは、居住エリアを示すために典型的に使用される単語であって、例えば、都道府県等の名前（例えば、富山、神奈川等）であってよい。 The residence category is a word typically used to indicate a residence area, and may be, for example, a name such as a prefecture (for example, Toyama, Kanagawa, etc.).

平均情報量の算出は、例えば、以下の式が使用される。

MI(w,c)：形態素となる単語wと居住カテゴリｃの間の平均相互情報量
Ｐを確率として、Ｐ（Ｘ，Ｙ）は、「Ｘ」と「Ｙ」とが共起する確率であり、Ｐ（Ｘ）、Ｐ（Ｙ）は、それぞれ個別に生起する確率である。 For example, the following formula is used to calculate the average information amount.

MI (w, c): P (X, Y) is the probability of co-occurrence of “X” and “Y”, with the average mutual information P between the morpheme word w and the residence category c as the probability. Yes, P (X) and P (Y) are probabilities of occurrence individually.

平均相互情報量ＭＩの特徴として、特定の居住カテゴリ（ｃ）に偏って高頻度で、Ｗｅｂページに出現する単語（ｗ）ほど、値が大きくなる。 As a characteristic of the average mutual information MI, the value of the word (w) appearing on the Web page with high frequency is biased toward a specific residence category (c).

この式は、例えば、

MI(w,c)：形態素となる単語wと居住カテゴリｃの間の平均相互情報量
ｅ：居住カテゴリｃの語を有し、かつ、形態素の単語を有するＷｅｂページの数
ｆ：居住カテゴリｃの語を有し、かつ、形態素の単語を有しないＷｅｂページの数
ｇ：居住カテゴリｃの語を有さず、かつ、形態素の単語を有するＷｅｂページの数
ｈ：居住カテゴリｃの語を有さず、かつ、形態素の単語を有しないＷｅｂページの数
これらの関係を表にしたものが、表１である。

Ｎ＝ｅ＋ｆ＋ｇ＋ｈと表すことが可能である。 This equation is, for example,

MI (w, c): Average mutual information between morpheme word w and residence category c e: Number of Web pages having residence category c and morpheme words f: Residence category c The number of Web pages that have the same words and no morpheme words g: The number of Web pages that do not have residence category c words and have morpheme words h: Have residence category c words The number of Web pages that do not have morpheme words. Table 1 shows these relationships.

N = e + f + g + h can be expressed.

表２を用いて、例えば、居住カテゴリを「富山」として、「八尾」という単語が、Ｗｅｂページに１００回出現する場合に、ＭＩを算出する。ここで、ｅは、「富山」の居住カテゴリのＷｅｂページで、「八尾」が出現するＷｅｂページの数であり、ｆは、「富山」の居住カテゴリのＷｅｂページで、「八尾」が出現しないＷｅｂページの数であり、ｇは、「富山」の居住カテゴリでないＷｅｂページで、「八尾」が出現するＷｅｂページの数であり、ｈは、「富山」の居住カテゴリでないＷｅｂページで「八尾」が出現しないＷｅｂページの数である。

これらの値を式２に代入すると、

と、算出される。 Using Table 2, for example, when the residence category is “Toyama” and the word “Yao” appears 100 times on the Web page, the MI is calculated. Here, e is the number of Web pages in which “Yao” appears in the “Toyama” residence category Web page, and f is the number of Web pages in the “Toyama” residence category and “Yao” does not appear. The number of Web pages, g is the number of Web pages that are not in the residence category of “Toyama” and “Yao” appears, and h is the number of Web pages that are not in the residence category of “Toyama” and “Yao” Is the number of Web pages that do not appear.

Substituting these values into Equation 2,

And calculated.

他の例として、表３を用いて、居住カテゴリを「富山」として、「婦中町」という単語が、Ｗｅｂページに２０回出現する場合に、ＭＩを算出する。

これらの値を式２に代入すると、

と、算出される。 As another example, using Table 3, when the residence category is “Toyama” and the word “Nuchu Town” appears 20 times on the Web page, the MI is calculated.

Substituting these values into Equation 2,

And calculated.

このように、「富山」の居住カテゴリが、所定の単語と関係があるかは、ＭＩの値を比較することで判断することが可能である。

In this way, whether the residence category of “Toyama” is related to a predetermined word can be determined by comparing the values of MI.

制御部１０１は、平均相互情報量算出部１３０が算出する平均相互情報量を、一つの居住カテゴリに対して、複数の単語で算出することで、式５に示すように、居住カテゴリ（富山）に対して、各々の単語（八尾、婦中町、富山市等）の平均相互情報量をテーブル等で関係づけて、データ記憶部１０７に記憶する。 The control unit 101 calculates the average mutual information amount calculated by the average mutual information amount calculation unit 130 with a plurality of words for one residence category, and as shown in Equation 5, the residence category (Toyama) On the other hand, the average mutual information amount of each word (Yao, Nuchucho, Toyama City, etc.) is stored in the data storage unit 107 in a table or the like.

次に、所定のＷｅｂページ４０が、いずれかの居住カテゴリに分類する分類部１４０について説明する。分類部１４０は、ブログ等のＷｅｂページ４０の入力を受付け（ステップＳ０４）、データ記憶部１０７を参照して（ステップＳ０５、Ｓ０６）、Ｗｅｂページ４０の居住カテゴリを出力する。 Next, the classification unit 140 that the predetermined Web page 40 classifies into any residence category will be described. The classification unit 140 receives input of the web page 40 such as a blog (step S04), refers to the data storage unit 107 (steps S05 and S06), and outputs the residence category of the web page 40.

例えば、分類部１４０は、Ｗｅｂページ４０に記載された文字データについて、形態素解析を行い、文章（複数の単語から構成される文字データ）を単語ごとに分けて、分けた単語を品詞ごとに分類して、名詞のみ（例えば、おわら）を抽出する。そして、分類部１４０は、データ記憶部１０７に記憶された単語の中に抽出した名詞（おわら）と一致するものがあるかを判断し、一致するものがある場合には、一致した単語に関係づけられた居住カテゴリ（富山）との平均相互情報量（０．０４）により、平均相互情報量が所定値以上（例えば、０．０３５以上）である場合には、当該居住カテゴリを、このＷｅｂページ４０の居住カテゴリ（富山）とする。 For example, the classification unit 140 performs morphological analysis on the character data described in the Web page 40, divides sentences (character data composed of a plurality of words) into words, and classifies the divided words into parts of speech. Then, only nouns (for example, Owara) are extracted. Then, the classification unit 140 determines whether there is a match with the extracted noun (wara) among the words stored in the data storage unit 107, and if there is a match, the classification unit 140 relates to the matched word. When the average mutual information amount (0.04) with the attached residence category (Toyama) is equal to or greater than a predetermined value (for example, 0.035 or more), the residence category is set to the Web The residence category (Toyama) on page 40 is assumed.

また、他の態様として、分類部１４０は、一つのＷｅｂページから複数の名詞（婦中町、おわら）を抽出し、データ記憶部１０７に記憶された単語と一以上一致するものがあるかを判断し、一以上一致するものがある場合には、一致した単語に関係づけられた居住カテゴリ（富山）と当該単語との平均相互情報量とを、それぞれの単語ごとに比較する。そして、分類部１４０は、全ての一致した単語の平均相互情報量を総合的に比較して、Ｗｅｂページ４０の居住カテゴリを決定してよい。図４に基づいて、後述する。 As another aspect, the classification unit 140 extracts a plurality of nouns (Nakachucho, Owara) from one Web page, and determines whether there is one or more that matches the word stored in the data storage unit 107. If there is one or more matches, the residence category (Toyama) related to the matched word and the average mutual information amount of the word are compared for each word. Then, the classification unit 140 may determine the residence category of the Web page 40 by comprehensively comparing the average mutual information amount of all the matched words. This will be described later with reference to FIG.

［装置１０のハードウェア構成］
図２は、図１で説明した本発明の好適な実施形態に係る装置１０のハードウェア構成の一例を示す図である。装置１０は、制御部１０１を構成するＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）１０１０（マルチプロセッサ構成ではＣＰＵ１０１２等複数のＣＰＵが追加されてもよい）、バスライン１００５、通信Ｉ／Ｆ１０４０、メインメモリ１０５０、ＢＩＯＳ（ＢａｓｉｃＩｎｐｕｔＯｕｔｐｕｔＳｙｓｔｅｍ）１０６０、ＵＳＢポート１０９０、Ｉ／Ｏコントローラ１０７０、並びにキーボード及びマウス１１００等の入力手段や表示装置１０２２を備えるコンピュータであってよい。 [Hardware Configuration of Device 10]
FIG. 2 is a diagram illustrating an example of a hardware configuration of the apparatus 10 according to the preferred embodiment of the present invention described in FIG. The apparatus 10 includes a central processing unit (CPU) 1010 (a plurality of CPUs such as a CPU 1012 may be added in a multiprocessor configuration) constituting a control unit 101, a bus line 1005, a communication I / F 1040, a main memory 1050, a BIOS ( (Basic Input Output System) 1060, USB port 1090, I / O controller 1070, keyboard, mouse 1100, and other input means and a display device 1022 may be used.

Ｉ／Ｏコントローラ１０７０には、テープドライブ１０７２、ハードディスク１０７４、光ディスクドライブ１０７６、半導体メモリ１０７８、等の記憶手段を接続することができる。 Storage means such as a tape drive 1072, a hard disk 1074, an optical disk drive 1076, and a semiconductor memory 1078 can be connected to the I / O controller 1070.

ＢＩＯＳ１０６０は、装置１０の起動時にＣＰＵ１０１０が実行するブートプログラムや、装置１０のハードウェアに依存するプログラム等を格納する。 The BIOS 1060 stores a boot program executed by the CPU 1010 when the apparatus 10 is started up, a program depending on the hardware of the apparatus 10, and the like.

記憶部１０７を構成するハードディスク１０７４は、装置１０がサーバとして機能するための各種プログラム及び本発明の機能を実行するプログラムを記憶しており、さらに必要に応じて各種データベースを構成可能である。 The hard disk 1074 constituting the storage unit 107 stores various programs for the device 10 to function as a server and programs for executing the functions of the present invention, and can configure various databases as necessary.

光ディスクドライブ１０７６としては、例えば、ＤＶＤ−ＲＯＭドライブ、ＣＤ−ＲＯＭドライブ、ＤＶＤ−ＲＡＭドライブ、ＣＤ−ＲＡＭドライブを使用することができる。この場合は各ドライブに対応した光ディスク１０７７を使用する。光ディスク１０７７から光ディスクドライブ１０７６によりプログラム又はデータを読み取り、Ｉ／Ｏコントローラ１０７０を介してメインメモリ１０５０又はハードディスク１０７４に提供することもできる。また、同様にテープドライブ１０７２に対応したテープメディア１０７１を主としてバックアップのために使用することもできる。 As the optical disc drive 1076, for example, a DVD-ROM drive, a CD-ROM drive, a DVD-RAM drive, or a CD-RAM drive can be used. In this case, the optical disk 1077 corresponding to each drive is used. A program or data may be read from the optical disk 1077 by the optical disk drive 1076 and provided to the main memory 1050 or the hard disk 1074 via the I / O controller 1070. Similarly, the tape medium 1071 corresponding to the tape drive 1072 can be used mainly for backup.

装置１０に提供されるプログラムは、ハードディスク１０７４、光ディスク１０７７、又はメモリーカード等の記録媒体に格納されて提供される。このプログラムは、Ｉ／Ｏコントローラ１０７０を介して、記録媒体から読み出され、又は通信Ｉ／Ｆ１０４０を介してダウンロードされることによって、装置１０にインストールされ実行されてもよい。 The program provided to the apparatus 10 is provided by being stored in a recording medium such as the hard disk 1074, the optical disk 1077, or a memory card. This program may be installed in the apparatus 10 and executed by being read from the recording medium via the I / O controller 1070 or downloaded via the communication I / F 1040.

前述のプログラムは、内部又は外部の記憶媒体に格納されてもよい。ここで、データ記憶部１０７を構成する記憶媒体としては、ハードディスク１０７４、光ディスク１０７７、又はメモリーカードの他に、ＭＤ等の光磁気記録媒体、テープ媒体を用いることができる。また、専用通信回線やインターネットに接続されたサーバシステムに設けたハードディスク１０７４又は光ディスクライブラリー等の記憶装置を記録媒体として使用し、通信回線を介してプログラムを装置１０に提供してもよい。 The aforementioned program may be stored in an internal or external storage medium. Here, in addition to the hard disk 1074, the optical disk 1077, or the memory card, a magneto-optical recording medium such as an MD or a tape medium can be used as a storage medium constituting the data storage unit 107. Further, a storage device such as a hard disk 1074 or an optical disk library provided in a server system connected to a dedicated communication line or the Internet may be used as a recording medium, and the program may be provided to the apparatus 10 via the communication line.

ここで、表示装置１０２２は、ユーザにデータの入力を受付ける画面を表示したり、装置１０による演算処理結果の画面を表示したりするものであり、ブラウン管表示装置（ＣＲＴ）、液晶表示装置（ＬＣＤ）等のディスプレイ装置を含む。 Here, the display device 1022 displays a screen for accepting data input to the user or displays a screen of a calculation processing result by the device 10, and is a cathode ray tube display device (CRT), a liquid crystal display device (LCD). ) And the like.

ここで、入力手段は、ユーザによる入力の受付を行うものであり、キーボード及びマウス１１００等により構成してよい。 Here, the input means accepts input by the user, and may be configured by a keyboard, a mouse 1100, and the like.

また、通信Ｉ／Ｆ１０４０は、装置１０を専用ネットワーク又は公共ネットワークを介して端末と接続できるようにするためのネットワーク・アダプタである。通信Ｉ／Ｆ１０４０は、モデム、ケーブル・モデム及びイーサネット（登録商標）・アダプタを含んでよい。 The communication I / F 1040 is a network adapter that enables the apparatus 10 to be connected to a terminal via a dedicated network or a public network. The communication I / F 1040 may include a modem, a cable modem, and an Ethernet (registered trademark) adapter.

以上の例は、装置１０について主に説明したが、装置に、プログラムをインストールして、その装置をサーバ装置として動作させることにより上記で説明した機能を実現することもできる。したがって、本発明において一実施形態として説明したサーバにより実現される機能は、上述の方法を当該装置により実行することにより、或いは、上述のプログラムを当該装置に導入して実行することによっても実現可能である。 In the above example, the apparatus 10 has been mainly described. However, the functions described above can be realized by installing a program in the apparatus and operating the apparatus as a server apparatus. Therefore, the functions realized by the server described as an embodiment in the present invention can be realized by executing the above-described method by the apparatus, or by introducing the above-described program into the apparatus and executing it. It is.

図３は、装置１０が、トレーニングデータ３０の入力を受けて、データ記憶部１０７に、居住カテゴリと、抽出した形態素と、平均相互情報量とが関係づけられたデータを記憶する好適な他の実施態様を示す概念図である。 FIG. 3 shows another example in which the apparatus 10 receives the training data 30 and stores data in which the residence category, the extracted morpheme, and the average mutual information amount are associated with each other in the data storage unit 107. It is a conceptual diagram which shows an embodiment.

最初に、定期的にＷｅｂサーバを巡回するクローラ２２０に対して、所定の居住カテゴリ（例えば、「富山」）に関連したＷｅｂページのＵＲＬ（Uniform Resource Locator）リスト２１０が、所定の装置から入力される（ステップＳ１０）。ここで、居住カテゴリとそれに関連したＷｅｂページを抽出する処理は、ユーザが行ってもよい。 First, a URL (Uniform Resource Locator) list 210 of Web pages related to a predetermined residence category (for example, “Toyama”) is input from a predetermined device to the crawler 220 that periodically visits the Web server. (Step S10). Here, the process of extracting the residence category and the Web page related thereto may be performed by the user.

この居住カテゴリ（「富山」）に関連したＷｅｂページのＵＲＬリストに基づいて、クローラ２２０は、インターネット上のＷｅｂサーバ２３０に対して、ＵＲＬへのリクエストを行い（ステップＳ１１）、Ｗｅｂサーバ２３０のコンテンツ２４０ａを取得する（ステップＳ１２）。コンテンツ２４０ａを取得したクローラ２２０は、コンテンツ２４０ａを、居住カテゴリ付き事例データベース２５０に記憶する（ステップＳ１３）。この際に、コンテンツ２４０ａをコンテンツ２４０ｂに加工（データ変換）してもよい。すなわち、当該コンテンツ２４０ａであるＷｅｂページから、頻繁に使用される所定の単語のみを抽出して、コンテンツ２４０ｂとしてもよい。 Based on the URL list of the Web page related to this residence category (“Toyama”), the crawler 220 makes a request for the URL to the Web server 230 on the Internet (Step S11), and the contents of the Web server 230 240a is acquired (step S12). The crawler 220 that has acquired the content 240a stores the content 240a in the case database with residence category 250 (step S13). At this time, the content 240a may be processed (data conversion) into the content 240b. That is, only a predetermined word that is frequently used may be extracted from the Web page that is the content 240a to be the content 240b.

例えば、居住カテゴリ「富山」にて、収集した一つのＷｅｂページに、「婦中町」という単語が５つ、「富山市」という単語が３つ、「八尾」という単語が３つあった場合に、これらを図３に示すように、居住カテゴリと関係づけたデータとして、居住カテゴリ付き事例データベース２５０に記憶させる。 For example, in the residence category “Toyama”, when one collected web page has five words “Nakachu-cho”, three words “Toyama City”, and three words “Yao” As shown in FIG. 3, these are stored in the case database with residence category 250 as data related to the residence category.

このようなデータを、多数記憶することで、逆に、居住カテゴリ付きデータベース２５０には、「富山」の居住カテゴリで、「婦中町」を含まないＷｅｂページの数や、「富山」のカテゴリに属さずに、「婦中町」を含むＷｅｂページの数がともに記憶される。このため、例えば、居住カテゴリ「富山」において、単語「婦中町」等、の平均相互情報量を算出するためのデータを全て準備することが可能となる。 By storing a large number of such data, conversely, in the database 250 with a residence category, the number of Web pages that do not include “Nchuchu-cho” in the residence category “Toyama” and the category “Toyama”. Without belonging, the number of Web pages including “Nakachu Town” is stored together. For this reason, for example, in the residence category “Toyama”, it is possible to prepare all data for calculating the average mutual information amount of the word “Nuchucho” and the like.

居住カテゴリ付き事例データベース２５０から、居住カテゴリ付きのデータが装置１０に入力された場合（ステップＳ１４）には、装置１０の学習部１０５が、処理を行い、データ記憶部１０７に、居住カテゴリと、抽出した形態素と、平均相互情報量とが関係づけられたデータを記憶する（ステップＳ１５）。 When data with a residence category is input to the device 10 from the case database with residence category 250 (step S14), the learning unit 105 of the device 10 performs processing, and the data storage unit 107 stores the residence category, Data in which the extracted morpheme is associated with the average mutual information is stored (step S15).

ここで、装置１０は、居住カテゴリ付き事例データベース２５０のように、既にＷｅｂページから所定の単語が抽出され、平均相互情報量が算出可能なデータが記憶されている場合には、形態素解析部１１０や形態素抽出部１２０の処理を行わなくてよい。 Here, as in the case database with residence category 250, the apparatus 10 has already extracted a predetermined word from a Web page and stored data that can calculate the average mutual information amount, the morpheme analysis unit 110. And the processing of the morpheme extraction unit 120 may not be performed.

最終的に、データ記憶部１０７には、図３に示すように、単語が平均相互情報量とともに、関係づけられて記憶される。 Finally, as shown in FIG. 3, the data storage unit 107 stores the words in association with the average mutual information amount.

次に、図４に基づいて、装置１０の分類部１４０の処理を説明する。Ｗｅｂページのうち、居住カテゴリが決定していない事例が装置１０に入力されるとする（ステップＳ２０）。例えば、「映画」が３回、「シネマ」が１回、「川崎」が２回、記載されたＷｅｂページの場合に、このＷｅｂページを、どの居住カテゴリに分類するかを決定したい。この場合には、装置１０の分類部１４０が、データ記憶部１０７に記憶されたデータに基づいて（ステップＳ２１、Ｓ２２）、それぞれの単語の平均相互情報量を算出して、比較して、居住カテゴリを決定する。 Next, processing of the classification unit 140 of the device 10 will be described based on FIG. It is assumed that a case where the residence category is not determined among the Web pages is input to the device 10 (step S20). For example, in the case of a Web page in which “Movie” is described three times, “Cinema” is described once, and “Kawasaki” is described twice, it is desired to determine which residence category the Web page is classified into. In this case, the classification unit 140 of the device 10 calculates the average mutual information amount of each word based on the data stored in the data storage unit 107 (steps S21 and S22), compares them, Determine the category.

例えば、分類部１４０は、当該Ｗｅｂページから複数の名詞（映画、川崎）を抽出し、データ記憶部１０７に記憶された単語と一以上一致するものがあるかを判断し、一以上一致するものがある場合には、一致した単語に関係づけられた居住カテゴリ（神奈川、横浜等）と当該単語との平均相互情報量とを、それぞれの単語ごとに比較する。例えば、居住カテゴリ「神奈川」と単語「映画」の平均相互情報量は、０．０１、居住カテゴリ「神奈川」と単語「川崎」との平均相互情報量は、０．０５であるとして、居住カテゴリ「横浜」と単語「映画」との平均相互情報量は、０．００１、居住カテゴリ「横浜」と単語「川崎」との平均相互情報量は、０．０１である場合には、結果的に、分類部１４０は、平均相互情報量の和が大きい、居住カテゴリ「神奈川」と決定してもよい。 For example, the classification unit 140 extracts a plurality of nouns (movies, Kawasaki) from the Web page, determines whether there is at least one word that matches the word stored in the data storage unit 107, and matches at least one If there is, the living category (Kanagawa, Yokohama, etc.) related to the matched word and the average mutual information amount of the word are compared for each word. For example, it is assumed that the average mutual information amount of the residence category “Kanagawa” and the word “movie” is 0.01, and the average mutual information amount of the residence category “Kanagawa” and the word “Kawasaki” is 0.05. When the average mutual information amount between “Yokohama” and the word “movie” is 0.001, and the average mutual information amount between the residence category “Yokohama” and the word “Kawasaki” is 0.01, as a result The classification unit 140 may determine the residence category “Kanagawa” having a large average mutual information amount.

以上、本発明の実施形態を説明したが、具体例を例示したに過ぎず、特に本発明を限定しない。また、本発明の実施形態に記載された効果は、本発明から生じる最も好適な効果を列挙したに過ぎず、本発明による効果は、本発明の実施形態に記載された効果に限定されない。 As mentioned above, although embodiment of this invention was described, it only showed the specific example and does not specifically limit this invention. Further, the effects described in the embodiments of the present invention only list the most preferable effects resulting from the present invention, and the effects of the present invention are not limited to the effects described in the embodiments of the present invention.

本発明の好適な実施形態に係る装置１０の機能ブロック及び処理のフローを示す図である。It is a figure which shows the functional block of the apparatus 10 which concerns on suitable embodiment of this invention, and the flow of a process. 本発明の好適な実施形態に係る装置１０のハードウェア構成の一例を示す図である。It is a figure which shows an example of the hardware constitutions of the apparatus 10 which concerns on suitable embodiment of this invention. 本発明の好適な他の実施形態の一例に係る装置１０の処理を説明する図である。It is a figure explaining the process of the apparatus 10 which concerns on an example of suitable other embodiment of this invention. 本発明の好適な他の実施形態の一例に係る装置１０の処理を説明する図である。It is a figure explaining the process of the apparatus 10 which concerns on an example of suitable other embodiment of this invention.

Explanation of symbols

１０装置
３０トレーニングデータ
４０Ｗｅｂページ
１０１制御部
１０５学習部
１０７データ記憶部
１１０形態素解析部
１２０形態素抽出部
１３０平均相互情報量算出部
１４０分類部
２１０ＵＲＬリスト
２２０クローラ
２４０ａ、ｂコンテンツ
２５０居住カテゴリ付き事例データベース
１００５バスライン
１０１０、１０１２ＣＰＵ
１０２２表示装置
１０５０メインメモリ
１０７０コントローラ
１０７１テープメディア
１０７２テープドライブ
１０７４ハードディスク
１０７６光ディスクドライブ
１０７７光ディスク
１０７８半導体メモリ
１０９０ＵＳＢポート
１１００キーボード及びマウス 10 apparatus 30 training data 40 web page 101 control unit 105 learning unit 107 data storage unit 110 morpheme analysis unit 120 morpheme extraction unit 130 average mutual information amount calculation unit 140 classification unit 210 URL list 220 crawler 240a, b content 250 case with residence category Database 1005 Bus line 1010, 1012 CPU
1022 Display device 1050 Main memory 1070 Controller 1071 Tape medium 1072 Tape drive 1074 Hard disk 1076 Optical disk drive 1077 Optical disk 1078 Semiconductor memory 1090 USB port 1100 Keyboard and mouse

Claims

A device that infers the residence area of the content creator,
Among the contents, means for morphological analysis of training data including a description related to a residential area;
Means for extracting a predetermined morpheme from the result of the morpheme analysis;
Means for calculating an average mutual information amount between the extracted morpheme and the residence category;
Means for storing data in which the residence category, the extracted morpheme, and the average mutual information amount of the residence category and the morpheme are related;
Means for classifying the inputted predetermined content into the residence category based on the data stored in the means for storing;
A device comprising:

An apparatus for classifying content according to claim 1,
The means for calculating the average mutual information amount has P as a probability,

MI (w, c): An apparatus that calculates an average mutual information amount from an average mutual information amount between a word w as a morpheme and a category c.

An apparatus for classifying the content according to claim 1 or 2,
The means for calculating the average mutual information amount is:

A device is a method of classifying content,
Morphological analysis of training data including a description related to a residential area of the content;
Extracting a predetermined morpheme from the result of the morpheme analysis;
Calculating an average mutual information amount between the extracted morpheme and a residence category;
Storing the data relating the residence category, the extracted morpheme, and the average mutual information amount of the residence category and the morpheme;
Classifying the input predetermined content into the residence category based on the data stored in the storing step;
A method comprising the steps of:

The method of classifying content according to claim 4, wherein P is a probability,
In the step of calculating the average mutual information amount,

A method of calculating an average mutual information amount by:

A method for classifying content according to claim 4 or claim 5, comprising:
In the step of calculating the average mutual information amount,

For devices that classify content,
Morphological analysis of training data including a description related to a residential area of the content;
Extracting a predetermined morpheme from the result of the morpheme analysis;
Calculating an average mutual information amount between the extracted morpheme and a residence category;
Storing the data relating the residence category, the extracted morpheme, and the average mutual information amount of the residence category and the morpheme;
Classifying the input predetermined content into the residence category based on the data stored in the storing step;
A program for running

An apparatus for classifying web pages related to a blog,
A means for morphological analysis of training data including a description related to a living area in the web page;
Means for extracting a predetermined morpheme from the result of the morpheme analysis;
Means for calculating an average mutual information amount between the extracted morpheme and the residence category;
Means for storing data in which the residence category, the extracted morpheme, and the average mutual information amount of the residence category and the morpheme are related;
Means for classifying the inputted predetermined web page into the residence category based on the data stored in the means for storing;
A device comprising: