JP5311378B2

JP5311378B2 - Feature word automatic learning system, content-linked advertisement distribution computer system, search-linked advertisement distribution computer system, text classification computer system, and computer programs and methods thereof

Info

Publication number: JP5311378B2
Application number: JP2008167639A
Authority: JP
Inventors: 禎夫黒橋; 知秀柴田
Original assignee: Kyoto University
Current assignee: Kyoto University
Priority date: 2008-06-26
Filing date: 2008-06-26
Publication date: 2013-10-09
Anticipated expiration: 2028-06-26
Also published as: JP2010009307A

Description

本発明は、特徴語自動学習システム、コンテンツ連動型広告配信コンピュータシステム、検索連動型広告配信コンピュータシステム、およびテキスト分類コンピュータシステム、並びにこれらのコンピュータプログラムおよび方法に関するものである。 The present invention relates to a feature word automatic learning system, a content-linked advertisement distribution computer system, a search-linked advertisement distribution computer system, a text classification computer system, and a computer program and method thereof.

近年、ブログやソーシャルネットワーキングシステムなどのＣＧＭ（Consumer Generated Media）が注目を浴び、ネット上の「クチコミ」が消費者の購買行動に大きな影響を与えている。
ＣＧＭの普及に伴い、消費者の興味・関心に即した広告を提示するコンテンツ連動型広告の市場がますます大きくなってきている。 In recent years, CGM (Consumer Generated Media) such as blogs and social networking systems has attracted attention, and “reviews” on the Internet have had a major impact on consumer purchasing behavior.
With the spread of CGM, the market for content-linked advertising that presents advertisements in line with consumers' interests is growing.

現在運用されているコンテンツ連動型広告としては、Ｇｏｏｇｌｅアドセンス（非特許文献１参照）、マイクロアド（非特許文献２参照）などがある。
従来のコンテンツ連動型広告の配信システムでは、まず広告主が、広告に対応するキーワードを設定しておく。すると、システムがＷｅｂコンテンツを解析し、それに基づき最適な広告を掲載している。
Ｇｏｏｇｌｅアドセンスホームページ、［online］、［平成２０年６月２３日検索］、インターネット < http://www.googole.com/adsense/?hl=ja> 株式会社マイクロアドホームページ、［online］、［平成２０年６月２３日検索］、インターネット < http://www.microad.jp/> Examples of content-linked advertisements currently in operation include Google Adsense (see Non-Patent Document 1), Micro Ad (see Non-Patent Document 2), and the like.
In a conventional content-linked advertisement distribution system, an advertiser first sets a keyword corresponding to an advertisement. Then, the system analyzes the web content and places an optimal advertisement based on the analysis.
Google Adsense Homepage, [online], [Search June 23, 2008], Internet <http://www.googole.com/adsense/?hl=en> Microad website, [online], [Search June 23, 2008], Internet <http://www.microad.jp/>

さて、本発明者らは、商品カテゴリそれぞれに対応する特徴語を取得しておき、Ｗｅｂサイトに含まれる特徴語（キーワード）から、当該Ｗｅｂサイトを商品カテゴリによって分類し、その商品カテゴリに応じて広告を表示するという着想を得た。
このようなシステムでは、個々の広告に対して、想定される多種多様なキーワードを設定する必要がなく、個々の広告には予め決まった商品カテゴリを付与すれば足りる。したがって、広告配信システムの運用が容易となる。 Now, the present inventors acquire feature words corresponding to each product category, classify the website according to the product category from the feature words (keywords) included in the website, and according to the product category. I got the idea of displaying advertisements.
In such a system, it is not necessary to set a wide variety of assumed keywords for each advertisement, and it is sufficient to assign a predetermined product category to each advertisement. Therefore, the operation of the advertisement distribution system is facilitated.

ところが、このようなシステムを構築するには、商品カテゴリそれぞれに対応する特徴語を予め用意する必要が生じる。
例えば、「マスカラ」というカテゴリに対しては「アイライナー」、「アイブロウ」、「ビューラー」、「まつ毛」などといった特徴語が必要となる。 However, in order to construct such a system, it is necessary to prepare in advance feature words corresponding to each product category.
For example, for the category “mascara”, characteristic words such as “eyeliner”, “eyebrow”, “viewer”, “eyelash”, and the like are required.

各商品カテゴリの特徴語は、カテゴリの多さや新規カテゴリの出現の問題から人手で整備するには大きなコストがかかり、自動獲得が要望される。 The feature words of each product category require a large cost for manual maintenance due to the large number of categories and the appearance of new categories, and automatic acquisition is required.

そこで、本発明では、カテゴリに対応した特徴語を自動的に学習するためのシステムや、カテゴリに対応した特徴語を用いて広告配信などを行うシステムを提供することを目的とする。 Accordingly, an object of the present invention is to provide a system for automatically learning feature words corresponding to a category, and a system for performing advertisement distribution using feature words corresponding to a category.

（１）本発明は、所定のカテゴリそれぞれに対応した特徴語を自動的に学習するシステムであって、カテゴリを示す主キーワードをクエリとして、検索エンジンによって複数のＷｅｂテキストを取得するＷｅｂテキスト取得手段と、前記主キーワードをクエリとして得られた前記複数のＷｅｂテキストから、特徴語候補を抽出する特徴語候補抽出手段と、抽出された特徴語候補とカテゴリとの関連度を算出する関連度算出手段と、各カテゴリについて、前記関連度が所定の閾値よりも高い特徴語候補を、当該カテゴリに対応した特徴語として関連付けて記憶する特徴語デーベースと、を備えることを特徴とする特徴語自動学習システムである。
上記本発明によれば、カテゴリに対応した特徴語を自動的に得ることができる。 (1) The present invention is a system that automatically learns feature words corresponding to each predetermined category, and that acquires a plurality of Web texts by a search engine using a main keyword indicating the category as a query. And feature word candidate extraction means for extracting a feature word candidate from the plurality of Web texts obtained using the main keyword as a query, and a degree of association calculation means for calculating the degree of association between the extracted feature word candidate and the category And a feature word database that stores, as a feature word corresponding to the category, a feature word candidate that has a degree of association higher than a predetermined threshold value for each category. System.
According to the present invention, a feature word corresponding to a category can be obtained automatically.

なお、本発明において、カテゴリを示す主キーワードは、カテゴリの名前が好ましいが、カテゴリの名前に限られるものではない。また、一つのカテゴリについて、主キーワードが一つでもよいし、複数でもよい。 In the present invention, the main keyword indicating a category is preferably a category name, but is not limited to a category name. Moreover, one main keyword may be sufficient about one category, and plural may be sufficient as it.

（２）前記関連度算出手段は、検索エンジンが検索対象とするＷｅｂテキスト群において、前記主キーワードと前記特徴語候補とが共に出現する度合いを示す共起度を、前記関連度として算出するのが好ましい。 (2) The relevance calculation means calculates a co-occurrence indicating the degree of occurrence of the main keyword and the feature word candidate together as the relevance in the Web text group to be searched by the search engine. Is preferred.

（３）前記関連度算出手段は、前記主キーワードおよび前記特徴語候補をクエリとして検索エンジンによって検索した場合のヒットカウントに基づいて、前記関連度を算出するのが好ましい。 (3) It is preferable that the relevance calculation means calculates the relevance based on a hit count when a search engine searches the main keyword and the feature word candidate as a query.

（４）前記関連度算出手段は、前記主キーワードおよび前記特徴語候補をクエリとして検索エンジンによって検索した場合の第１ヒットカウントと、前記主キーワードをクエリとして検索エンジンによって検索した場合の第２ヒットカウントと、前記特徴語候補をクエリとして検索エンジンによって検索した場合の第３ヒットカウントと、に基づいて、前記主キーワードと前記特徴語候補との自己相互情報量を算出するのが好ましい。 (4) The relevance calculation means includes a first hit count when a search engine searches for the main keyword and the feature word candidate as a query, and a second hit when a search engine searches for the main keyword as a query. It is preferable to calculate a self-mutual information amount between the main keyword and the feature word candidate based on the count and a third hit count when the search is performed by a search engine using the feature word candidate as a query.

（５）一つのカテゴリについて抽出された複数の特徴語候補を絞り込んで、抽出された特徴語候補の数を少なくするための絞込処理手段を更に備え、前記関連度算出手段は、前記絞込処理手段によって絞り込まれた特徴語候補について、前記主キーワードとの関連度を算出するのが好ましい。この場合、抽出された全ての特徴語候補について、主キーワードとの関連度を算出する必要がなく、演算負荷を低減できる。 (5) It further includes narrowing-down processing means for narrowing down a plurality of feature word candidates extracted for one category and reducing the number of extracted feature word candidates, and the relevance calculation means includes the narrowing-down processing means It is preferable to calculate the degree of association with the main keyword for the feature word candidates narrowed down by the processing means. In this case, it is not necessary to calculate the degree of association with the main keyword for all extracted feature word candidates, and the calculation load can be reduced.

（６）前記絞込処理手段は、前記Ｗｅｂテキスト取得手段によって取得された前記複数のＷｅｂテキストにおいて、前記特徴語候補が出現する頻度に基づいて、一つのカテゴリについて抽出された複数の特徴語候補を絞り込む処理を行うのが好ましい。 (6) The narrowing-down processing means is a plurality of feature word candidates extracted for one category based on the frequency of appearance of the feature word candidates in the plurality of Web texts acquired by the Web text acquisition means. It is preferable to perform processing for narrowing down.

（７）前記絞込処理手段は、前記Ｗｅｂテキスト取得手段によって取得された前記複数のＷｅｂテキストにおいて、前記特徴語候補が出現する第１頻度と、検索エンジンが検索対象とするＷｅｂテキスト群において、前記特徴語候補が出現する第２頻度と、に基づいて、一つのカテゴリについて抽出された複数の特徴語候補を絞り込む処理を行うのが好ましい。 (7) The narrowing-down processing unit includes a first frequency at which the feature word candidate appears in the plurality of Web texts acquired by the Web text acquisition unit, and a Web text group to be searched by the search engine. It is preferable to perform a process of narrowing down a plurality of feature word candidates extracted for one category based on the second frequency at which the feature word candidates appear.

（８）前記絞込処理手段は、抽出された特徴語候補それぞれについて、絞込用演算式によって絞込用スコアを算出し、当該絞込用スコアが、所定の絞込用閾値よりも大きいものを、絞り込まれた特徴語とするよう構成され、前記絞込用演算式は、第１頻度が高いと絞込用スコアが高くなり、第２頻度が高いと前記絞込用スコアが低くなるよう構成されているのが好ましい。 (8) The narrowing processing means calculates a narrowing score for each of the extracted feature word candidates using a narrowing calculation formula, and the narrowing score is larger than a predetermined narrowing threshold. Is defined as a narrowed feature word, and the narrowing-down calculation formula is such that the narrowing score increases when the first frequency is high, and the narrowing score decreases when the second frequency is high. Preferably, it is configured.

（９）前記特徴語候補抽出手段は、検索エンジンが検索対象とするＷｅｂテキスト群における出現頻度が、所定の頻度閾値よりも高い高頻度語を、特徴語候補から除外する手段を有するのが好ましい。この場合、一般的な語を、特徴語候補から除外することができる。 (9) The feature word candidate extraction means preferably includes means for excluding, from the feature word candidates, high-frequency words whose appearance frequency in the Web text group to be searched by the search engine is higher than a predetermined frequency threshold. . In this case, general words can be excluded from the feature word candidates.

（１０）他の観点からみた本発明は、Ｗｅｂコンテンツの内容に関連した広告を配信するコンテンツ連動型広告配信コンピュータシステムであって、カテゴリそれぞれに対応する特徴語を記憶した特徴語データベースと、カテゴリそれぞれに対応する広告データを記憶した広告データベースと、広告配信対象のＷｅｂコンテンツを解析してキーワードを抽出するキーワード抽出手段と、抽出されたキーワードに基づいて前記特徴語データベースを参照し、前記Ｗｅｂコンテンツに対応する１又は複数のカテゴリを選択する選択手段と、選択されたカテゴリに基づいて前記広告データベースを参照し、選択されたカテゴリの広告データを、広告配信対象のＷｅｂコンテンツとともに表示させる手段と、を備え、前記特徴語データベースとして、上述の特徴語自動学習システムによって得られた特徴語データベースを用いることを特徴とするコンテンツ連動型広告配信コンピュータシステムである。 (10) The present invention from another viewpoint is a content-linked advertisement distribution computer system that distributes advertisements related to the contents of Web content, a feature word database storing feature words corresponding to each category, and a category An advertisement database storing advertisement data corresponding to each of them; a keyword extracting means for analyzing the advertisement-delivered web content to extract keywords; and referring to the feature word database based on the extracted keywords; Selection means for selecting one or a plurality of categories corresponding to, and means for referring to the advertisement database based on the selected category and displaying the advertisement data of the selected category together with the Web content to be distributed as an advertisement; The feature word database includes A contextual advertising distribution computer system, which comprises using a characteristic word database obtained by predicates characteristic word automatic learning system.

（１１）さて、キーワード抽出手段（口語調コンテンツ解析システム）は、解析対象であるＷｅｂテキストなどのテキストに対して、形態素解析を行ってキーワードとなる形態素を抽出する形態素解析手段と、Ｗｅｂテキストなどのテキストに口語調テキストが含まれていることによる形態素解析誤りの可能性を検出する検出手段と、形態素解析誤りの可能性が検出された形態素を、キーワードから除外する手段と、を備えることができる。このようにすることで、口語調テキストから適切にキーワードを抽出することができる。 (11) Now, the keyword extraction means (spoken tone content analysis system) performs morphological analysis on the text such as the Web text to be analyzed to extract the morpheme that becomes the keyword, the Web text, etc. Detecting means for detecting a possibility of a morphological analysis error due to the inclusion of colloquial text in the text of the text, and means for excluding a morpheme in which a possibility of a morphological analysis error is detected from a keyword. it can. In this way, keywords can be appropriately extracted from colloquial text.

（１２）前記検出手段は、前記キーワードとなる形態素の前または後にある形態素に基づいて、形態素解析誤りの可能性を検出するのが好ましい。 (12) Preferably, the detecting means detects a possibility of a morphological analysis error based on a morpheme before or after the morpheme serving as the keyword.

（１３）前記検出手段は、前記キーワードとなる形態素の前または後にある形態素が、ひらがな１文字、またはカタカナ１文字であって、品詞不明であると判別された場合に、前記キーワードの候補となる形態素形の態素解析誤りの可能性を検出するのが好ましい。 (13) The detection means becomes a candidate for the keyword when it is determined that the morpheme before or after the morpheme serving as the keyword is one hiragana character or one katakana character and the part of speech is unknown. It is preferable to detect the possibility of a morphological form morphological analysis error.

（１４）前記検出手段は、前記キーワードとなる形態素の前または後にある形態素が、小文字のひらがな、または小文字のカタカナ１文字である場合に、前記キーワードとなる形態素形の態素解析誤りの可能性を検出するのが好ましい。 (14) When the morpheme before or after the keyword morpheme is a lowercase hiragana or a lowercase katakana character, the detection means may cause a morphological analysis error of the keyword morpheme Is preferably detected.

（１５）他の観点からみた本発明は、検索キーワードに関連した広告を配信する検索連動型広告配信コンピュータシステムであって、カテゴリそれぞれに対応する特徴語を記憶した特徴語データベースと、カテゴリそれぞれに対応する広告データを記憶した広告データベースと、検索キーワードに基づいて前記特徴語データベースを参照し、前記検索キーワードに対応する１又は複数の商品カテゴリを選択する選択手段と、選択されたカテゴリに基づいて前記広告データベースを参照し、選択されたカテゴリの広告データを、Ｗｅｂサイトに表示させる手段と、を備え、前記特徴語データベースとして、上述の特徴語自動学習システムによって得られた特徴語データベースを用いることを特徴とする検索連動型広告配信コンピュータシステムである。 (15) The present invention from another viewpoint is a search-linked advertisement distribution computer system that distributes advertisements related to search keywords, a feature word database storing feature words corresponding to each category, and each category An advertisement database storing corresponding advertisement data, a selection means for referring to the feature word database based on a search keyword and selecting one or a plurality of product categories corresponding to the search keyword, and based on the selected category Means for displaying advertisement data of a selected category on a website by referring to the advertisement database, and using the feature word database obtained by the feature word automatic learning system as the feature word database. Search-linked advertisement distribution computer system characterized by A.

（１６）さらに他の観点からみた本発明は、テキストデータの分類コンピュータシステムであって、所定のカテゴリそれぞれに対応する特徴語を記憶した特徴語データベースと、分類対象のテキストデータを解析してキーワードを抽出するキーワード抽出手段と、抽出されたキーワードに基づいて前記特徴語データベースを参照し、前記テキストデータに対応するカテゴリを選択する選択手段と、を備え、前記特徴語データベースとして、上述の特徴語自動学習システムによって得られた特徴語データベースを用いることを特徴とするテキスト分類コンピュータシステムである。 (16) From another viewpoint, the present invention is a text data classification computer system, in which a feature word database storing feature words corresponding to respective predetermined categories, and a keyword obtained by analyzing text data to be classified. And a selection unit that selects a category corresponding to the text data by referring to the feature word database based on the extracted keyword, and the feature word database includes the above feature word This is a text classification computer system characterized by using a feature word database obtained by an automatic learning system.

（１７）さらに他の観点からみた本発明は、コンピュータを、上述の特徴語自動学習システムとして機能させるためのコンピュータプログラムである。 (17) The present invention from still another viewpoint is a computer program for causing a computer to function as the above-described feature word automatic learning system.

（１８）さらに他の観点からみた本発明は、コンピュータを、上述のコンテンツ連動型広告配信コンピュータシステムとして機能させるためのコンピュータプログラムである。 (18) The present invention from still another viewpoint is a computer program for causing a computer to function as the above-described content-linked advertisement distribution computer system.

（１９）さらに他の観点からみた本発明は、コンピュータを、上述の検索連動型広告配信コンピュータシステムとして機能させるためのコンピュータプログラムである。 (19) The present invention viewed from still another viewpoint is a computer program for causing a computer to function as the above-described search-linked advertisement distribution computer system.

（２０）さらに他の観点からみた本発明は、コンピュータを、上述のテキスト分類コンピュータシステムとして機能させるためのコンピュータプログラムである。 (20) The present invention from still another viewpoint is a computer program for causing a computer to function as the above-described text classification computer system.

（２１）さらに他の観点からみた本発明は、所定のカテゴリそれぞれに対応した特徴語が記憶された特徴語データベースをコンピュータによって自動的に生成する方法であって、コンピュータが、カテゴリを示す主キーワードをクエリとして、検索エンジンによって複数のｗｅｂテキストを取得するステップと、コンピュータが、前記主キーワードをクエリとして得た前記複数のＷｅｂテキストから、特徴語候補を抽出するステップと、コンピュータが、前記主キーワードと前記特徴語候補との関連度を算出するステップと、コンピュータが、各カテゴリについて、前記関連度が所定の閾値よりも高い特徴語候補を、当該カテゴリに対応した特徴語として関連付けて特徴語データベースに記憶するステップと、を含むことを特徴とする特徴語データベース自動生成方法である。 (21) From another viewpoint, the present invention is a method of automatically generating a feature word database storing feature words corresponding to each predetermined category by a computer, wherein the computer is a main keyword indicating a category. Using a search engine as a query, a plurality of web texts by a search engine, a computer extracting feature word candidates from the plurality of Web texts obtained using the main keyword as a query, and a computer including the main keyword A step of calculating a degree of association between the feature word candidate and a feature word database in which the computer associates, for each category, a feature word candidate having a degree of association higher than a predetermined threshold as a feature word corresponding to the category And a step of storing in the feature word It is a database automatic generation method.

（２２）さらに他の観点からみた本発明は、Ｗｅｂコンテンツの内容に関連した広告をコンピュータによって配信する方法であって、コンピュータが、広告配信対象のＷｅｂコンテンツを解析してキーワードを抽出するステップと、コンピュータが、抽出されたキーワードに基づいて、上述の特徴語データベース自動生成方法で得られた前記特徴語データベースを参照し、前記Ｗｅｂコンテンツに対応する１又は複数のカテゴリを選択するステップと、コンピュータが、選択されたカテゴリに基づいて、カテゴリそれぞれに対応する広告データを記憶した広告データベースを参照し、選択されたカテゴリの広告データを、広告配信対象のＷｅｂコンテンツとともに表示させるステップと、を含むことを特徴とする方法である。 (22) The present invention as viewed from still another aspect is a method of distributing advertisements related to the contents of Web content by a computer, wherein the computer analyzes the Web content to be distributed and extracts keywords. A computer refers to the feature word database obtained by the above-described feature word database automatic generation method based on the extracted keyword, and selects one or a plurality of categories corresponding to the Web content; Referring to an advertisement database storing advertisement data corresponding to each category based on the selected category, and displaying the advertisement data of the selected category together with the Web content to be distributed as an advertisement. It is the method characterized by this.

（２３）さらに他の観点からみた本発明は、検索キーワードに関連した広告をコンピュータによって配信する方法であって、コンピュータが、検索キーワードに基づいて、上述の特徴語データベース自動生成方法で得られた前記特徴語データベースを参照し、前記検索キーワードに対応する１又は複数のカテゴリを選択するステップと、コンピュータが、選択された商品カテゴリに基づいて、カテゴリそれぞれに対応する広告データを記憶した広告データベースを参照し、選択されたカテゴリの広告データを、Ｗｅｂサイトに表示させるステップと、を含むことを特徴とする方法である。 (23) The present invention viewed from still another aspect is a method for distributing advertisements related to a search keyword by a computer, and the computer is obtained by the above feature word database automatic generation method based on the search keyword. A step of selecting one or a plurality of categories corresponding to the search keyword with reference to the feature word database, and an advertisement database in which the computer stores advertisement data corresponding to each category based on the selected product category Referencing and displaying the advertisement data of the selected category on a web site.

（２４）さらに他の観点からみた本発明は、テキストデータをコンピュータによって分類する方法であって、分類対象のテキストデータを解析してキーワードを抽出するステップと、抽出されたキーワードに基づいて、上述の特徴語データベース自動生成方法で得られた前記特徴語データベースを参照し、前記テキストデータに対応するカテゴリを選択するステップと、を含むことを特徴とする方法である。 (24) The present invention as viewed from still another aspect is a method for classifying text data by a computer, the step of analyzing the text data to be classified and extracting keywords, and the above-mentioned based on the extracted keywords. And a step of selecting a category corresponding to the text data with reference to the feature word database obtained by the feature word database automatic generation method.

本発明によれば、カテゴリに対応した特徴語データベースを自動的に生成することができる。 According to the present invention, a feature word database corresponding to a category can be automatically generated.

以下、本発明の実施形態を図面に基づいて説明する。なお、本発明の第１の実施形態としてコンテンツ連動型広告配信コンピュータシステムを説明し、本発明の第２の実施形態として検索連動型広告配信コンピュータシステムを説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. A content-linked advertisement distribution computer system will be described as a first embodiment of the present invention, and a search-linked advertisement distribution computer system will be described as a second embodiment of the present invention.

［１．コンテンツ連動型広告配信コンピュータシステム］
［１．１システム全体構成］
図１〜図７は、第１実施形態に係るコンテンツ連動型広告配信コンピュータシステム（以下、単に「本システム」という）１を示している。
本システム１は、ブログ（ウェブログ；ｗｅｂｌｏｇ）などのＣＧＭ型のＷｅｂコンテンツとともに広告（広告データ）を、インターネット経由で配信し、ユーザ端末２の画面上にブログなどのＷｅｂコンテンツと共に広告を表示させるためのものである。 [1. Content-linked advertising distribution computer system]
[1.1 Overall system configuration]
1 to 7 show a content-linked advertisement distribution computer system (hereinafter simply referred to as “this system”) 1 according to the first embodiment.
The system 1 distributes advertisements (advertising data) together with CGM-type Web contents such as blogs (web logs; web logs) via the Internet, and displays advertisements together with Web contents such as blogs on the screen of the user terminal 2. Is for.

本システム１は、ブログサービスを提供するための処理を行うブログサーバ１１、ブログを解析してブログを商品カテゴリにマッピングする処理を行うマッピングサーバ１２、ブログとともに表示させる広告データを管理する広告サーバ１３、商品カテゴリそれぞれに対応する特徴語を自動学習する特徴語学習サーバ１４を備えている。 The system 1 includes a blog server 11 that performs a process for providing a blog service, a mapping server 12 that performs a process of mapping a blog to a product category by analyzing the blog, and an advertisement server 13 that manages advertisement data to be displayed together with the blog. A feature word learning server 14 that automatically learns feature words corresponding to each product category is provided.

また、ブログサーバ１１はブログテキストのデータを蓄積するためのブログデータベース１１ａを備え、広告サーバ１３は広告データを蓄積した広告データベース１３ａを備え、特徴語学習サーバ１４は、特徴語データベース１４ａを備えている。 The blog server 11 includes a blog database 11a for storing blog text data, the advertisement server 13 includes an advertisement database 13a that stores advertisement data, and the feature word learning server 14 includes a feature word database 14a. Yes.

本システムにおける上記の各機能は、コンピュータプログラムがコンピュータによって実行されることによって実現される。
なお、本システム１を構成する各サーバやデータベースは、それぞれが別々のコンピュータによって構成され、それらがネットワークによって接続されていてもよいし、一つのコンピュータの中に複数のサーバやデータベースの機能を実現するコンピュータプログラムが搭載されていてもよい。 Each of the above functions in this system is realized by a computer program being executed by a computer.
In addition, each server and database which comprise this system 1 are each comprised by a separate computer, and they may be connected by the network, and implement | achieve the function of the some server and database in one computer. A computer program may be installed.

［１．２特徴語学習サーバ（特徴語自動学習システム）］
図２は、特徴語学習サーバ１４の機能ブロックを示している。本システム１では、広告配信のため各ブログ記事に商品カテゴリを付与する。このため、特徴語学習サーバ１４では、各カテゴリに対応した特徴語を生成する。 [1.2 Feature word learning server (Feature word automatic learning system)]
FIG. 2 shows functional blocks of the feature word learning server 14. In this system 1, a product category is assigned to each blog article for advertisement distribution. For this reason, the feature word learning server 14 generates a feature word corresponding to each category.

図３に示すように、特徴語学習サーバ１４は、所定の商品カテゴリのカテゴリ名をクエリ（検索キーワード）として、検索エンジンによって検索を行って、インターネット上から所定数のＷｅｂテキストを収集する。そして、特徴語学習サーバ１４は、あるカテゴリ名で収集されたＷｅｂテキストから特徴語を抽出し、それらの特徴語を前記カテゴリに対応付ける。 As shown in FIG. 3, the feature word learning server 14 performs a search by a search engine using a category name of a predetermined product category as a query (search keyword), and collects a predetermined number of Web texts from the Internet. Then, the feature word learning server 14 extracts feature words from the Web text collected under a certain category name, and associates these feature words with the category.

さて、本実施形態では、商品カテゴリとして、財団法人流通システム開発センターのＪＩＣＦＳ（ＪＡＮＩｔｅｍＣｏｄｅＦｉｌｅＳｅｒｖｉｃｅ）分類を用いる。ＪＩＣＦＳとは、商品情報を一元的に管理するためのデータベースシステムである。ＪＩＣＦＳ分類は、大分類、中分類、小分類、細分類の４レベルで構成されており、例えば、細分類「醤油」は、（食品）−（加工食品）−（調味料）−（醤油）という分類となっている。本実施形態では、前記細分類（２１６１カテゴリ）を商品カテゴリとして採用する。 In the present embodiment, JICFS (JAN Item Code File Service) classification of the distribution system development center is used as the product category. JICFS is a database system for centrally managing product information. The JICFS classification is composed of four levels, a large classification, a medium classification, a small classification, and a fine classification. For example, the fine classification "soy sauce" is (food)-(processed food)-(seasoning)-(soy sauce). It is a classification. In the present embodiment, the fine classification (2161 category) is adopted as a product category.

図２に戻り、特徴語学習サーバ１４は、検索エンジン１４１と、Ｗｅｂテキストから特徴語の候補を抽出する特徴語候補抽出部１４２と、抽出された特徴語候補の絞り込みを行う絞込処理部１４３と、絞り込まれた特徴語候補それぞれとカテゴリ名との関連度（自己相互情報量）を算出する関連度算出部１４４と、関連度（自己相互情報量）と閾値との比較を行ってカテゴリに対応する特徴語を決定するための閾値比較部１４５と、を備えている。 Returning to FIG. 2, the feature word learning server 14 includes a search engine 141, a feature word candidate extraction unit 142 that extracts feature word candidates from Web text, and a narrowing processing unit 143 that narrows down the extracted feature word candidates. And a relevance calculation unit 144 for calculating the relevance (self-mutual information amount) between each of the narrowed-down feature word candidates and the category name, and comparing the relevance (self-mutual information amount) with a threshold value. A threshold comparison unit 145 for determining a corresponding feature word.

検索エンジン１４１は、本システムに予め記憶された商品カテゴリ（ＪＩＣＦＳ分類）の各カテゴリ名Ｃをクエリ（検索キーワード）として、インターネット上のＷｅｂサーバ３、または検索エンジン自身が保有するＷｅｂテキストから所定数（最大１０００件）を取得する。
なお、商用の検索エンジンＡＰＩでは十分なテキスト量を得ることができないため、本実施形態では、本発明者らが開発した検索エンジン基盤ＴＳＵＢＡＫＩ（http://tsubaki.ixnlp.nii.ac.jp/index.cgi）を、本システム１の検索エンジン１４１として採用した。 The search engine 141 uses each category name C of the product category (JICFS classification) stored in advance in the system as a query (search keyword) from the Web server 3 on the Internet or the Web text held by the search engine itself. Acquire (maximum 1000).
Since a sufficient amount of text cannot be obtained with a commercial search engine API, in this embodiment, a search engine infrastructure TSUBAKI (http://tsubaki.ixnlp.nii.ac.jp/ index.cgi) is adopted as the search engine 141 of the system 1.

前記特徴語候補抽出部１４２は、検索エンジン１４１によって取得したＷｅｂテキストから特徴語候補の抽出を行う。特徴語候補抽出部１４２は、Ｗｅｂテキストの形態素解析を行って所定の語を抽出する形態素解析器１４２ａと、形態素解析器によって抽出された語のうち、特徴語候補として明らかに不適切な語を除外する除外部１４２ｂと、を備えている。 The feature word candidate extraction unit 142 extracts feature word candidates from the Web text acquired by the search engine 141. The feature word candidate extraction unit 142 performs a morphological analysis of the Web text and extracts a predetermined word, and among the words extracted by the morpheme analyzer, a word that is clearly inappropriate as a feature word candidate An exclusion unit 142b to be excluded.

形態素解析器１４２ａは、取得した最大１０００件のＷｅｂテキストに対して、形態素解析を行い、Ｗｅｂテキストを形態素（基本的語彙）に分解するとともに、各形態素の品詞を決定する。なお、この形態素解析器１４２ａでは、品詞の決定できない形態素については未定義語とする。
本実施形態では、形態素解析器１４２ａとして、日本語形態素解析システムＪＵＭＡＮ（http://nlp.kuee.kyoto-u.ac.jp/nl-resource/juman.html）を採用した。 The morpheme analyzer 142a performs morpheme analysis on a maximum of 1000 acquired Web texts, decomposes the Web texts into morphemes (basic vocabulary), and determines the part of speech of each morpheme. In the morpheme analyzer 142a, morphemes whose part of speech cannot be determined are undefined words.
In the present embodiment, a Japanese morphological analysis system JUMAN (http://nlp.kuee.kyoto-u.ac.jp/nl-resource/juman.html) is employed as the morphological analyzer 142a.

形態素解析器１４２ａは、形態素である語のうち、品詞が名詞または未定義語である語と、その連続である複合名詞を抽出する（ただし、細分類が時相名詞のもの、ひらがな１文字、カタカナ１文字を除く）。
ここで、形態素解析器１４２ａが抽出した語は、ＪＵＭＡＮが出力する代表表記で扱う。これにより、例えば、「喉」、「のど」、「ノド」を、同一の代表表記「喉」で扱うことができ、表記の揺れを解消することができる。 The morphological analyzer 142a extracts a word whose part of speech is a noun or an undefined word and a compound noun that is a continuation of the words that are morphemes (however, the subclass is a temporal noun, one hiragana character, Excluding one katakana character).
Here, the words extracted by the morphological analyzer 142a are handled by the representative notation output by JUMAN. Thereby, for example, “throat”, “throat”, and “throat” can be handled by the same representative notation “throat”, and the shaking of the notation can be eliminated.

形態素解析器１４２ａが抽出した多数の語（名詞、未定義語、複合名詞）には、一般的な語であって、特徴語として適切でない語（例えば、「俺」）が含まれている場合がある。このような一般的な語は、Ｗｅｂ文書における出現頻度が非常に多い。
そこで、特徴語候補抽出部１４２の除外部１４２では、形態素解析器１４２ａが抽出した多数の語のうち、Ｗｅｂテキストにおける高頻度語を特徴語の候補として不適切な語であるとして除外する。 When many words (nouns, undefined words, compound nouns) extracted by the morphological analyzer 142a are general words and include words that are not appropriate as feature words (for example, “I”) There is. Such general words appear very frequently in Web documents.
Therefore, the exclusion unit 142 of the feature word candidate extraction unit 142 excludes a high-frequency word in the Web text from among a large number of words extracted by the morphological analyzer 142a as an inappropriate word as a feature word candidate.

具体的には、除外部１４２ｂは、形態素解析器１４２ａが抽出した多数の単語（形態素）それぞれを検索キーワード（クエリ）として、検索エンジン１４１によって検索を実行させる。そして、図４に示すように、形態素解析器１４２ａが抽出した各単語について、検索エンジン１４１によるヒットカウント（検索エンジンでヒットした文書数；Ｗｅｂテキストにおける出現頻度）を求める。そして、ヒットカウントが頻度閾値である２，０００，０００以上である単語については、高頻度語として破棄する。 Specifically, the exclusion unit 142b causes the search engine 141 to execute a search using each of a number of words (morphemes) extracted by the morpheme analyzer 142a as a search keyword (query). Then, as shown in FIG. 4, for each word extracted by the morphological analyzer 142a, a hit count by the search engine 141 (number of documents hit by the search engine; appearance frequency in the Web text) is obtained. A word having a hit count of 2,000,000 or more, which is a frequency threshold, is discarded as a high-frequency word.

以上の処理により、特徴語候補抽出部１４２からは、カテゴリ名Ｃによって得た所定数のＷｅｂテキストに含まれる名詞、未定義語、または複合名詞であって、高頻度語を除いたものが、カテゴリ名Ｃについての特徴語候補ｗとして出力される。 With the above processing, the feature word candidate extraction unit 142 is a noun, an undefined word, or a compound noun included in a predetermined number of Web texts obtained by the category name C, and excluding high-frequency words. It is output as a feature word candidate w for category name C.

前記絞込処理部１４３では、特徴語候補抽出部１４２から出力された多数の特徴語候補ｗの絞り込みを行って、一つのカテゴリについて抽出された特徴語候補ｗの数を少なくする。前記関連度算出部１４４では、絞り込まれた特徴語候補ｗそれぞれについて、カテゴリとの関連度を計算する。
ここで、関連度算出部１４４では、特徴語候補抽出部１４２によって抽出された全特徴語候補ｗについて、カテゴリＣとの関連度を計算してもよいが、全てのカテゴリＣについて多数の特徴語候補との関連度を計算することは計算コストがかかるため、上記のように、絞込処理によって、特徴語候補ｗの数を減らした上で、関連度を計算するのが好ましい。 The narrowing processing unit 143 narrows down a large number of feature word candidates w output from the feature word candidate extraction unit 142 to reduce the number of feature word candidates w extracted for one category. The degree-of-association calculating unit 144 calculates the degree of association with each category for each narrowed-down feature word candidate w.
Here, the degree-of-association calculating unit 144 may calculate the degree of association with the category C for all the feature word candidates w extracted by the feature word candidate extracting unit 142. Since calculating the degree of association with a candidate requires a calculation cost, it is preferable to calculate the degree of association after reducing the number of feature word candidates w by the narrowing-down process as described above.

本実施形態の絞込処理部１４３は、絞込用スコアであるＬＤＦ・ＩＧＤＦ値を計算するためのＬＤＦ・ＩＧＤＦ計算部１４３ａを有して構成されている。絞込処理部１４３ａでは、ＬＤＦ・ＩＧＤＦ値が大きい上位Ｌ件（＝５０件）の特徴語候補ｗに絞り込む。
ＬＤＦ・ＩＧＤＦ計算部１４３ａは、下記式（１）によって、絞込用スコアであるＬＤＦ・ＩＧＤＦ値を計算する。

ここで、ＬＤＦ（ｗ）は、検索エンジン１４１によってカテゴリＣでヒットした上位１０００件のＷｅｂテキストにおいて、特徴語候補ｗが出現する文書（テキスト）数（第１頻度）である。
ＧＤＦ（ｗ）は、検索エンジン１４１が検索対象とする全てのＷｅｂテキスト（Ｗｅｂテキスト群）において、特徴語候補ｗが出現する文書（テキスト）数（第２頻度）である。
Ｎは、検索エンジン１４１が検索対象とする全てのＷｅｂテキストの数であり、本実施形態では、Ｎ＝１００，０００，０００である。 The narrowing processing unit 143 according to the present embodiment includes an LDF / IGDF calculation unit 143a for calculating an LDF / IGDF value that is a narrowing score. The narrowing processing unit 143a narrows down to the top L (= 50) feature word candidates w having a large LDF / IGDF value.
The LDF / IGDF calculation unit 143a calculates an LDF / IGDF value that is a narrowing score by the following equation (1).

Here, LDF (w) is the number (first frequency) of documents (texts) in which the feature word candidate w appears in the top 1000 Web texts hit by category C by the search engine 141.
GDF (w) is the number (second frequency) of documents (texts) in which the feature word candidate w appears in all Web texts (Web text group) to be searched by the search engine 141.
N is the number of all Web texts to be searched by the search engine 141. In this embodiment, N = 100,000,000.

式（１）に示すように、第１頻度を示すＬＤＦ（ｗ）が大きくなると、絞込用スコアであるＬＤＦ・ＩＧＤＦ値は大きくなる。これは、カテゴリＣでヒットしたＷｅｂテキストにおける出現頻度が高いということは、カテゴリＣとの関連性が高いことを示唆していると考えられるからである。
一方、第２頻度を示すＧＤＦ（ｗ）が大きくなると、絞込用スコアであるＬＤＦ・ＩＧＤＦ値は大きくなる。これは、検索エンジン１４１が検索対象とする全てのＷｅｂテキストにおける出現頻度が高いということは、特定のカテゴリとの関連性が低いことを示唆していると考えられるからである。 As shown in Expression (1), when the LDF (w) indicating the first frequency increases, the LDF / IGDF value, which is a narrowing score, increases. This is because the high appearance frequency in the Web text hit in category C is considered to suggest that the relevance to category C is high.
On the other hand, when the GDF (w) indicating the second frequency increases, the LDF / IGDF value that is the score for narrowing increases. This is because a high appearance frequency in all Web texts to be searched by the search engine 141 is considered to suggest that the relevance to a specific category is low.

そして、関連度算出部（ＰＭＩ算出部）１４４は、絞込処理部１４３によって絞り込まれた上位５０件の特徴語候補ｗと、カテゴリ名Ｃとの関連度の算出を行う。
この関連度算出部１４４は、検索エンジン１４１が検索対象とするＷｅｂテキスト群において、カテゴリ名Ｃと特徴語候補ｗとが共に出現する度合い（共起度）を、関連度として算出する。この関連度が高いものが、カテゴリＣに対応する特徴語として採用される。 Then, the relevance calculation unit (PMI calculation unit) 144 calculates the relevance of the top 50 feature word candidates w narrowed down by the narrowing processing unit 143 and the category name C.
The degree-of-relevance calculation unit 144 calculates, as the degree of association, the degree (co-occurrence degree) that the category name C and the feature word candidate w appear together in the Web text group to be searched by the search engine 141. Those having a high degree of relevance are adopted as feature words corresponding to category C.

具体的には、関連度算出部１４４は、各カテゴリ名Ｃについて、カテゴリ名Ｃと特徴語候補ｗとの自己相互情報量（Pointwise Mutual Information, PMI）を算出する。
ＰＭＩは、以下の式（２）に従って計算される。

ここで、Ｐ（Ｘ）は、語Ｘの生起する確率を示し、Ｈｃ（Ｘ）は、語Ｘをクエリ（検索キーワード）として検索エンジン１４１で検索した場合のヒットカウント（ヒット件数）を示す。なお、Ｈｃ（Ｘ１，Ｘ２）は、語Ｘ１と語Ｘ２のＡＮＤ検索のヒットカウントである。 Specifically, the degree-of-association calculation unit 144 calculates, for each category name C, a self-mutual information amount (Pointwise Mutual Information, PMI) between the category name C and the feature word candidate w.
PMI is calculated according to the following equation (2).

Here, P (X) indicates the probability of occurrence of the word X, and Hc (X) indicates the hit count (number of hits) when the search is performed by the search engine 141 using the word X as a query (search keyword). Hc (X1, X2) is a hit count of AND search of the word X1 and the word X2.

関連度算出部１４４で算出された関連度（ＰＭＩ）は、閾値比較部１４５において、閾値ｔｈ（＝４）と比較される。カテゴリ名Ｃとの関連度が閾値ｔｈ＝４よりも大きい特徴語候補ｗが、カテゴリＣについての特徴語となる。
特徴語は、カテゴリに対応付けられて特徴語データベース１４ａに記憶される。なお、特徴語データベース１４ａの特徴語は、関連度（ＰＭＩ）の値とともに記憶される。
図５は、特徴語データベース１４ａの例を示している。例えば、カテゴリ「鼻炎用剤」には、特徴語として「ヒスタミン」、「鎮痛」、「気管支」、「花粉」など２３の単語が登録されている。なお、「ヒスタミン」、「鎮痛」、「気管支」、「花粉」の関連度（ＰＭＩ）は、それぞれ「５．６８７」、「５．４１０」、「５．０７５」、「４．０１０」である。 The relevance level (PMI) calculated by the relevance level calculation unit 144 is compared with the threshold value th (= 4) by the threshold value comparison unit 145. A feature word candidate w whose degree of association with the category name C is larger than the threshold th = 4 is a feature word for the category C.
The feature words are stored in the feature word database 14a in association with the category. Note that the feature words in the feature word database 14a are stored together with the relevance (PMI) value.
FIG. 5 shows an example of the feature word database 14a. For example, in the category “rhinitis agent”, 23 words such as “histamine”, “analgesic”, “bronchi” and “pollen” are registered as characteristic words. The degree of association (PMI) of “histamine”, “analgesic”, “bronchi”, and “pollen” is “5.687”, “5.410”, “5.075”, and “4.010”, respectively. is there.

なお、特徴語とみなすためのＰＭＩの閾値ｔｈは、すべてのカテゴリで同一のため、カテゴリごとに特徴語の数が異なっている。また、ある単語が、複数のカテゴリの特徴語となる場合もある。 Since the PMI threshold th for being regarded as a feature word is the same for all categories, the number of feature words differs for each category. In addition, a certain word may be a characteristic word of a plurality of categories.

以上説明した処理を、各商品カテゴリについて実行することで、各カテゴリに対応する特徴語を記憶した特徴語データベース１４ａを生成することができる。
なお、特徴語データベース１４ａを生成する処理を、定期的に実行して、特徴語データベース１４ａを随時更新しても良い。 By executing the processing described above for each product category, a feature word database 14a storing feature words corresponding to each category can be generated.
Note that the feature word database 14a may be updated at any time by periodically executing the process of generating the feature word database 14a.

［１．３マッピングサーバ（テキスト分類機能）］
図６は、マッピングサーバ１２の機能ブロックを示している。このマッピングサーバ１２は、個々のブログ記事（ブログテキスト）を商品カテゴリにマッピングするためのものである。 [1.3 Mapping server (text classification function)]
FIG. 6 shows functional blocks of the mapping server 12. This mapping server 12 is for mapping individual blog articles (blog texts) to product categories.

マッピングサーバ１２は、ブログテキスト（携帯ブログテキスト）からマッピング用キーワードを抽出するキーワード抽出部１２１と、抽出されたキーワードに基づいて、ブログテキストと商品カテゴリとのマッピング情報を生成するマッピング処理部１２２と、を備えている。 The mapping server 12 includes a keyword extracting unit 121 that extracts mapping keywords from the blog text (mobile blog text), a mapping processing unit 122 that generates mapping information between the blog text and the product category based on the extracted keywords, It is equipped with.

キーワード抽出部１２１は、ブログテキストの形態素解析を行う形態素解析器（ＪＵＭＡＮ）１２１ａと、形態素解析器による形態素解析の誤りの可能性を検出する検出部１２１ｂと、解析誤りの可能性がある「あやしい」形態素をキーワードから除外する除外部１２１ｃと、を備えている。 The keyword extraction unit 121 includes a morpheme analyzer (JUMAN) 121a that performs morphological analysis of a blog text, a detection unit 121b that detects a possibility of an error in morphological analysis by the morpheme analyzer, An excluding unit 121c for excluding morphemes from keywords.

形態素解析器１２１ａは、特徴語の学習時と同様に、品詞が名詞（ただし、細分類が時相名詞のもの、ひらがな１文字、カタカナ１文字を除く）または未定義語の形態素（語）を抽出する。
抽出された形態素（語）は、基本的に、マッピング処理部１２２におけるスコア計算に用いられるキーワードとなる。ただし、本実施形態でマッピング対象（分類対象）とするテキストは、ＣＧＭであるブログテキスト（携帯ブログテキスト）であるため、口語調の表現が多い。このような口語調テキストに対して形態素解析を行うと、新聞テキスト等に比べて、形態素解析の誤りが多くみられる。 Similar to the learning of feature words, the morpheme analyzer 121a uses morphemes (words) that are part-of-speech nouns (however, subclasses are temporal phase nouns, except for one hiragana character and one katakana character) or undefined words. Extract.
The extracted morphemes (words) basically become keywords used for score calculation in the mapping processing unit 122. However, since the text to be mapped (classified) in the present embodiment is blog text (mobile blog text) that is CGM, there are many colloquial expressions. When morphological analysis is performed on such colloquial text, errors in morphological analysis are more common than newspaper texts.

形態素解析誤りは、以下の例のように、特にひらがな表記の場合に目立つ
（口語調テキストの例１）言われてたんだなぁ
上記例１の場合、形態素解析を行うと「だな」がキーワードとして抽出される可能性がある。そして、キーワード「だな」は、商品カテゴリ「たな一般」の特徴語となっているため、カテゴリ「たな一般」にスコアが与えられてしまう。 Morphological analysis error, as in the following example, particularly noticeable in the case of the hiragana notation (for example, colloquial tone text 1) said the case of Do § Example 1 above but sputum, when the morphological analysis "Bookshelf" is It may be extracted as a keyword. Since the keyword “Dana” is a characteristic word of the product category “Tana General”, a score is given to the category “Tana General”.

（口語調テキストの例２）絵をぺそりと
上記例２の場合も、形態素解析を行うと「そり」がキーワードとして抽出される可能性がある。 Even if a picture (example 2 colloquial tone text) Bae sled and the example 2, when the morphological analysis "warp" is likely to be extracted as a keyword.

そこで、検出部１２１ｂでは、形態素解析誤りの典型的なパターンである「あやしい」ひらがな語（形態素）を検出する。「あやしい」（＝形態素解析誤りの可能性がある）とは、以下のようなものである。
解析誤り可能性の検出規則１：前後いずれかの形態素にひらがな１文字またはカタカナ１文字の未定義語（品詞が不明の語）がある。
解析誤り可能性の検出規則２：前後いずれかの形態素が、小文字のひらがな（っ、ゃ、ゅ、ょ、ぁ、ぃ、ぅ、ぇ、ぉ）、または小文字のカタカナ（ッ、ャ、ュ、ョ、ァ、ィ、ゥ、ェ、ォ）である。 Therefore, the detection unit 121b detects an “insignificant” hiragana word (morpheme) that is a typical pattern of morphological analysis errors. “Ambiguous” (= possible morphological analysis error) is as follows.
Detection rule of possibility of analysis error 1: There is an undefined word (a word whose part of speech is unknown) in one hirame or one hiragana character in any morpheme.
Analysis error possibility detection rule 2: Either the morpheme before or after is a lower-case hiragana (h, nya, yu, oo, ii, ぅ, eh, ぉ), or a lower case katakana (h, ya, yu, (Yo, a, i, u, e, o).

上記規則１によれば、上記例２における「そり」は、「そり」の前の「ぺ」が未定義語となるため、「そり」を「あやしい」形態素であると検出することができる。したがって、除外部１２１ｃによって「そり」がキーワードから除外される。
また、上記規則２によれば、上記例１における「だな」は、「だな」の後の「ぁ」が小さいひらがなであるため、「だな」を「あやしい」形態素であると検出することができる。したがって、除外部１２１ｃによって「だな」がキーワードから除外される。 According to Rule 1 above, “sledge” in Example 2 above can be detected as an “unfriendly” morpheme because “pe” before “sledge” becomes an undefined word. Therefore, the “sledge” is excluded from the keyword by the exclusion unit 121c.
Also, according to rule 2, “Dana” in Example 1 above is a hiragana with a small “a” after “Dana”, and therefore “Dana” is detected as an “unfriendly” morpheme. be able to. Therefore, “Dada” is excluded from the keyword by the exclusion unit 121c.

以上のようにして抽出された語のうち、特徴語データベース１４ａにおいていずれかのカテゴリＣで特徴語となっているものがマッピング用キーワードとされ、マッピング処理部１２２に与えられる。 Of the words extracted as described above, a keyword that is a feature word in any category C in the feature word database 14 a is used as a mapping keyword and is given to the mapping processing unit 122.

マッピング処理部１２２は、スコア計算部１２２ａと、マッピング部１２２ｂとを備えており、マッピング用キーワードに基づいて、スコアを計算し、スコアに基づいてブログ記事を商品カテゴリにマッピングするためのマッピング情報を生成する。 The mapping processing unit 122 includes a score calculation unit 122a and a mapping unit 122b. The mapping processing unit 122 calculates a score based on the mapping keyword, and mapping information for mapping the blog article to the product category based on the score. Generate.

スコア計算部１２２ａは、各カテゴリＣについて、Ｓｃｏｒｅ（Ｃ）を、以下の式（３）に従って計算する。

ここで、ＰＭＩ（Ｃ，ｗ）は、カテゴリＣとキーワード（特徴語）ｗとの相互情報量（関連度）であり、特徴語データベースから取得される。ｔｆ（ｗ）は、ブログテキスト中におけるキーワードｗの頻度を示す。 The score calculation unit 122a calculates Score (C) for each category C according to the following equation (3).

Here, PMI (C, w) is a mutual information amount (relevance) between the category C and the keyword (feature word) w, and is acquired from the feature word database. tf (w) indicates the frequency of the keyword w in the blog text.

式（３）によれば、ブログテキスト中に含まれるキーワードが、特徴語データベース１４ａに特徴語として多く登録されているカテゴリほど、高いスコアとなる。高いスコアのカテゴリほど、そのブログテキストとの関連性が高いとみなすことができる。 According to Expression (3), the higher the score, the higher the number of keywords included in the blog text registered as feature words in the feature word database 14a. A category with a higher score can be considered more relevant to the blog text.

マッピング部１２２ｂでは、スコアの高い順にカテゴリをソートし、スコア上位の１又は複数のカテゴリ（ここでは、３件のカテゴリ）を、ブログに対応するカテゴリとして選択する。選択されたカテゴリは、スコアとともに、マッピング情報として出力され、広告サーバ１３に与えられる。 The mapping unit 122b sorts the categories in descending order of score, and selects one or more categories (here, three categories) having the highest score as categories corresponding to the blog. The selected category is output as mapping information together with the score and given to the advertisement server 13.

図７は、ブログテキストの商品カテゴリへのマッピング例を示している。ただし、図７ではマッピング用キーワード以外の大部分の文章を省略している。図中のブログテキストにおいて下線部を付した「ノド」と「カラオケ」がマッピング用キーワードとして抽出された。そして、図示のブログテキストに対応する商品カテゴリとして、「カラオケ・歌集・歌謡曲楽譜」（スコア４．７２６）、「トローチ剤」（スコア４．５５４）、「うがい薬」（スコア４．０１９）が選択されている。 FIG. 7 shows an example of mapping blog text to product categories. However, in FIG. 7, most of the sentences other than the mapping keyword are omitted. In the blog text in the figure, “Nod” and “Karaoke” with an underlined part are extracted as mapping keywords. Then, as product categories corresponding to the illustrated blog text, “Karaoke / Songs / Kyokyoku Score” (score 4.726), “troche” (score 4.554), “gargle” (score 4.019) Is selected.

マッピング情報を受け取った広告サーバ１３は、当該マッピング情報に基づいて、広告データベース１３ａを参照し、ブログとともに表示する１又は複数の広告データを選択する。広告データベース１３ａには、広告データが商品カテゴリごとに分類して蓄積されており、広告サーバ１３は、マッピング情報が示す商品カテゴリの広告データを選択することで、ブログとともに表示する１又は複数の広告データ決定することができる。 The advertisement server 13 that has received the mapping information refers to the advertisement database 13a based on the mapping information and selects one or a plurality of advertisement data to be displayed together with the blog. In the advertisement database 13a, advertisement data is classified and stored for each product category, and the advertisement server 13 selects one or more advertisements to be displayed together with the blog by selecting the advertisement data of the product category indicated by the mapping information. Data can be determined.

広告サーバ１３が選択した広告データは、ブログテキスト等とともにユーザ端末２へ配信され、ブログ上に表示される。 The advertisement data selected by the advertisement server 13 is distributed to the user terminal 2 together with the blog text and displayed on the blog.

［１．４実験例］
上記した特徴語学習サーバ１４によって、ＪＩＣＦＳカテゴリの特徴語を自動学習した。ここで、ＪＩＣＦＳカテゴリのうち、「その他」という文字を含むものを除いた１７７１カテゴリを用いた。なお、カテゴリ名から「その他」を除いた文字列が別のカテゴリとして存在していることが多いため、「その他」という文字を含むものを除いた。例えば「電子ゲーム」と「電子ゲームその他」というカテゴリがあるので、「電子ゲームその他」は除いた。
１７７１カテゴリのうち、一つ以上の特徴語が学習されたカテゴリが１４６４カテゴリであり、１カテゴリあたりの特徴語の平均数は９．５語であった。 [1.4 Experimental example]
The feature word learning server 14 automatically learned feature words in the JICFS category. Here, among the JICFS categories, 1771 categories excluding those including the characters “others” were used. Note that a character string excluding “others” from the category name often exists as another category, and therefore, those including the characters “others” were excluded. For example, since there are categories of “electronic game” and “electronic game other”, “electronic game other” is excluded.
Among the 1771 categories, the category where one or more feature words were learned was 1464 categories, and the average number of feature words per category was 9.5 words.

次に、上記したマッピングサーバ１２によるブログの自動マッピングの解析結果を評価した。ここでは、ブログとして、ロックウェーブ社の携帯ブログサイトａｉｍｅｗ（http://aimew.jp/）を用いた。５０件のブログ記事に対して最大３カテゴリを人手で付与し、同じ５０件のブログ記事に対してマッピングサーバ１２が選択した上位３件のカテゴリと比較して、評価した。評価は、適合率・再現率・Ｆ値で行った。結果は、適合率８１．０％（６４／７９）、再現率８０．０％（６４／８０）、Ｆ値０．８０５であり、比較的良好な結果を示した。 Next, the analysis result of the automatic mapping of the blog by the mapping server 12 described above was evaluated. Here, a mobile blog site aimew (http://aimew.jp/) of Rock Wave was used as a blog. A maximum of 3 categories were manually assigned to 50 blog articles, and compared with the top 3 categories selected by the mapping server 12 for the same 50 blog articles. The evaluation was performed based on the precision, recall, and F value. As a result, the matching rate was 81.0% (64/79), the recall rate was 80.0% (64/80), and the F value was 0.805, indicating a relatively good result.

［２．検索連動型広告配信コンピュータシステム］
図８および図９は、第２実施形態に係る検索連動型広告配信コンピュータシステム（以下、単に「本システム」という）１０１を示している。
本システム１０１は、検索エンジンに対する検索キーワードに関連する広告（広告データ）を、インターネット経由で配信し、ユーザ端末２の画面上に、検索結果などを示すｗｅｂコンテンツなどと共に広告を表示させるためのものである。 [2. Search-linked advertisement distribution computer system]
8 and 9 show a search-linked advertisement distribution computer system (hereinafter simply referred to as “this system”) 101 according to the second embodiment.
This system 101 distributes advertisements (advertising data) related to search keywords to a search engine via the Internet, and displays advertisements on the screen of the user terminal 2 along with web contents indicating search results and the like. It is.

本システム１０１は、ユーザに対してインターネットの検索サービスを提供するための処理を行う検索エンジン１１１、検索エンジン１１１へ入力された検索キーワードを解析して検索結果画面を商品カテゴリにマッピングする処理を行うマッピングサーバ１１２、検索結果とともに表示させる広告データを管理する広告サーバ１３、商品カテゴリそれぞれに対応する特徴語を自動学習する特徴語学習サーバ１４などを備えている。
なお、第２実施形態に係る本システム１０１において、特に説明をしない点については、第１実施形態のものと同様である。また、第１実施形態と同様の機能については、図面において同じ符号を付している。 The system 101 performs a process for mapping a search result screen to a product category by analyzing a search keyword input to the search engine 111 that performs a process for providing an Internet search service to a user and a search engine 111. A mapping server 112, an advertisement server 13 that manages advertisement data to be displayed together with search results, a feature word learning server 14 that automatically learns feature words corresponding to each product category, and the like are provided.
In the system 101 according to the second embodiment, points that are not particularly described are the same as those in the first embodiment. Further, the same functions as those in the first embodiment are denoted by the same reference numerals in the drawings.

第２実施形態の本システム１０１が、第１実施形態の本システム１と異なる点は、第１実施形態におけるブログサーバ１１が検索エンジン１１１に置き換わっている点と、マッピングサーバ１１２が図９に示す構成となっている点である。 The system 101 of the second embodiment is different from the system 1 of the first embodiment in that the blog server 11 in the first embodiment is replaced with a search engine 111 and the mapping server 112 is shown in FIG. It is the point which becomes composition.

第２実施形態におけるマッピングサーバ１１２は、検索エンジン１１１から検索キーワードが、マッピング用キーワードとして与えられるため、キーワード抽出部１２１が必要ない。
このため、マッピングサーバ１１２は、マッピング処理１２２だけを有して構成されている。 The mapping server 112 according to the second embodiment does not require the keyword extraction unit 121 because the search keyword is given from the search engine 111 as a mapping keyword.
For this reason, the mapping server 112 is configured to include only the mapping process 122.

マッピング処理部１２２は、与えられた１又は複数の検索キーワードをマッピング用キーワードとして、第１実施形態と同様の処理を行ってマッピング情報を生成する。
広告サーバ１３は、マッピング情報に基づいて、広告データベース１３ａを参照し、検索結果等とともに表示する１又は複数の広告データを選択する。 The mapping processing unit 122 generates mapping information by performing the same processing as in the first embodiment using one or more given search keywords as mapping keywords.
Based on the mapping information, the advertisement server 13 refers to the advertisement database 13a and selects one or a plurality of advertisement data to be displayed together with the search result and the like.

広告サーバ１３が選択した広告データは、検索結果等とともにユーザ端末２へ配信され、検索結果とともに表示される。 The advertisement data selected by the advertisement server 13 is distributed to the user terminal 2 together with the search result and displayed together with the search result.

［３．その他のシステムへの応用］
第１実施形態および第２実施形態の本システム１，１０１は、広告配信に関するシステムであったが、マッピングサーバ１２，１１２および特徴語学習サーバ１４の機能は、Ｗｅｂテキストなどの文章を所定のカテゴリに分類することが必要な他のシステムにも利用できる。つまり、第１実施形態および第２実施形態の本システム１，１０１は、テキストの分類コンピュータシステムとして捉えることもできる。 [3. Application to other systems]
The systems 1 and 101 of the first embodiment and the second embodiment are systems related to advertisement distribution, but the functions of the mapping servers 12 and 112 and the feature word learning server 14 are used to convert sentences such as Web text into predetermined categories. It can also be used for other systems that need to be classified as That is, the systems 1 and 101 of the first and second embodiments can be regarded as a text classification computer system.

なお、本発明は、上記実施形態に限定されるものではなく、様々な変形が可能である。
例えば、カテゴリは、商品カテゴリに限られるものではなく、どのようなカテゴリであってもよい。 In addition, this invention is not limited to the said embodiment, A various deformation | transformation is possible.
For example, the category is not limited to the product category, and may be any category.

また、上記実施形態では、最大１０００件のＷｅｂテキストを検索エンジン１４１によって収集する際のクエリ（検索キーワード）を「カテゴリ名」自体としたが、クエリは、カテゴリ名そのものである必要はなく、検索に適した用語を採用してもよい。例えば、カテゴリ「スキー防具」のようにカテゴリ名から具体的な商品が連想しにくい場合、カテゴリ名そのものをクエリとすると、収集されるＷｅｂテキストが不適切となる可能性がある。このような場合に対処するため、クエリ（主キーワード）としては、そのカテゴリにおける具体的な製品名など、別の用語を採用してもよい。また、カテゴリ名と具体的な製品名などのＡＮＤ検索を行っても良い。 In the above embodiment, the query (search keyword) used when collecting a maximum of 1000 Web texts by the search engine 141 is the “category name” itself, but the query need not be the category name itself. Terms suitable for the above may be adopted. For example, when it is difficult to associate a specific product from a category name such as the category “ski armor”, if the category name itself is used as a query, the collected Web text may be inappropriate. In order to deal with such a case, another term such as a specific product name in the category may be adopted as the query (main keyword). In addition, an AND search such as a category name and a specific product name may be performed.

さらに、形態素の解析誤りの可能性を検出するための規則は、例示したものに限られず、他の規則を含めても良い。他の規則としては、例えば、「雷は見てるんは綺麗やけど音は嫌い。」といったテキストにおける「やけど」を、「あやしい」形態素としてみなすもの含めることができる。
さらに、「スパイダース」の「ダース」が、カテゴリ「ラクロスボール」の特徴語となっている場合、「ダース」をマッピング用キーワードとして抽出しない方がよい。このため、ブログテキストの固有表現解析を行い、固有表現内の形態素は、マッピング用キーワードとみなさないようにすることができる。 Furthermore, the rules for detecting the possibility of morphological analysis errors are not limited to those illustrated, and other rules may be included. Other rules may include, for example, what treats “burns” in text such as “ I don't like thunder watching beautifully , but I don't like sound,” as a suspicious morpheme.
In addition, "dozen" of the "spy dozen" is, if that is the feature words of the category "lacrosse ball", it is better not to extract the "dozen" as a mapping for the keyword. For this reason, it is possible to perform a specific expression analysis of the blog text so that morphemes in the specific expression are not regarded as mapping keywords.

さらに、ブログ記事中の多義語の曖昧性を解消する処理を行うのが好ましい。例えば、「・・・ブランコに乗る」の「ブランコ」が、カテゴリ「遊具」だけでなく、カテゴリ「釣用履物」の特徴語となっている（釣りで、ブランコ仕掛けというものがある）。この問題は、多義語の曖昧性を解消することで対処できる。 Furthermore, it is preferable to perform a process for eliminating the ambiguity of the polysemy in the blog article. For example, “the swing ” of “... on the swing” is a characteristic word of the category “fishing footwear” as well as the category “playing equipment” (there is fishing, there is a swing mechanism). This problem can be dealt with by eliminating the ambiguity of the polysemy.

コンテンツ連動型広告配信コンピュータシステムの構成図である。It is a block diagram of a content-linked advertisement distribution computer system. 特徴語学習サーバの構成図である。It is a block diagram of a feature word learning server. 特徴語学習サーバの処理の概念図である。It is a conceptual diagram of the process of a feature word learning server. 単語のヒットカウントを示す表である。It is a table | surface which shows the hit count of a word. 特徴語データベースの構成図である。It is a block diagram of a feature word database. マッピングサーバの構成図である。It is a block diagram of a mapping server. マッピング処理実行結果の具体例を示す図である。It is a figure which shows the specific example of a mapping process execution result. 検索連動型広告配信コンピュータシステムSearch-linked advertisement distribution computer system マッピングサーバの構成図である。It is a block diagram of a mapping server.

Explanation of symbols

１コンテンツ連動型広告配信コンピュータシステム（テキスト分類システム）
２ユーザ端末
３Ｗｅｂサーバ
１１ブログサーバ
１１ａブログデータベース
１２マッピングサーバ
１３広告サーバ
１３ａ広告データベース
１４特徴語学習サーバ（特徴語自動学習システム）
１４ａ特徴語データベース（特徴語自動学習システム）
１４１検索エンジン
１４２特徴語候補抽出部
１４３絞込処理部
１４４関連度算出部
１４５閾値比較部
１２１キーワード抽出部
１２１ａ形態素解析器
１２１ｂ形態素解析誤り検出部
１２１ｃ解析誤り形態素除外部
１２２マッピング処理部
１２２ａスコア計算部
１２２ｂマッピング部
１０１検索連動型広告配信コンピュータシステム（テキスト分類システム）
１１１検索エンジン
１１２マッピングサーバ
Ｃカテゴリ名
ｗ特徴語（候補） 1 Content-linked advertising distribution computer system (text classification system)
2 User terminal 3 Web server 11 Blog server 11a Blog database 12 Mapping server 13 Advertising server 13a Advertising database 14 Feature word learning server (feature word automatic learning system)
14a Feature word database (Feature word automatic learning system)
141 search engine 142 feature word candidate extraction unit 143 narrowing processing unit 144 relevance calculation unit 145 threshold comparison unit 121 keyword extraction unit 121a morpheme analyzer 121b morpheme analysis error detection unit 121c analysis error morpheme exclusion unit 122 mapping processing unit 122a score calculation 122b Mapping unit 101 Search-linked advertisement distribution computer system (text classification system)
111 Search Engine 112 Mapping Server C Category Name w Feature Word (Candidate)

Claims

A system that automatically learns feature words corresponding to each predetermined category,
Web text acquisition means for acquiring a plurality of Web texts by a search engine using a main keyword indicating a category as a query,
Feature word candidate extraction means for extracting feature word candidates from the plurality of Web texts obtained using the main keyword as a query;
Relevance calculating means for calculating the relevance between the extracted feature word candidates and the category;
For each category, a feature word database that stores a feature word candidate having a higher degree of association than a predetermined threshold as a feature word corresponding to the category;
Equipped with a,
The relevance calculating means queries the co-occurrence degree indicating the degree of occurrence of both the main keyword and the feature word candidate in the Web text group to be searched by the search engine, and queries the main keyword and the feature word candidate. Is calculated as the relevance based on the hit count when the search engine searches
An automatic feature word learning system.

The relevance calculation means includes:
A first hit count when a search engine searches the main keyword and the feature word candidate as a query;
A second hit count when a search engine searches the main keyword as a query;
A third hit count when a search engine searches for the feature word candidate as a query;
Claim 1 Symbol mounting feature word automatic learning system, the calculated self mutual information between the characteristic word candidate and the main keyword based on.

Further comprising a narrowing processing means for narrowing down a plurality of feature word candidates extracted for one category and reducing the number of extracted feature word candidates;
The relevance calculating means, the feature word candidates narrowed down by the narrow-down processing means, the feature word automatic learning system according to claim 1 or 2, wherein calculating the relevance between the main keyword.

The narrowing-down processing means narrows down a plurality of feature word candidates extracted for one category based on the frequency with which the feature word candidates appear in the plurality of Web texts acquired by the Web text acquisition means. The feature word automatic learning system according to claim 3 , wherein:

The narrowing processing means is
A first frequency at which the feature word candidates appear in the plurality of Web texts acquired by the Web text acquisition unit;
A second frequency in which the feature word candidate appears in the Web text group to be searched by the search engine;
The feature word automatic learning system according to claim 3 , wherein a process for narrowing down a plurality of feature word candidates extracted for one category is performed based on.

The narrowing-down processing means calculates a narrowing score for each extracted feature word candidate using a narrowing calculation formula, and narrows down those whose narrowing score is larger than a predetermined narrowing threshold. Configured feature word candidates,
The feature word automatic learning according to claim 5, wherein the narrowing calculation formula is configured such that when the first frequency is high, the narrowing score is high, and when the second frequency is high, the narrowing score is low. system.

The characteristic word candidate extraction means, frequency of occurrence Web text group the search engine and search target, a high high-frequency words than a predetermined frequency threshold, according to claim 1 to 6 comprising means to exclude from the feature word candidates The feature word automatic learning system according to any one of the preceding claims.

A content-linked advertisement distribution computer system that distributes advertisements related to the contents of Web contents,
A feature word database storing feature words corresponding to each category;
An ad database that stores ad data for each category;
A keyword extracting means for analyzing the Web content to be distributed and extracting keywords;
Selection means for referring to the feature word database based on the extracted keyword and selecting one or more categories corresponding to the Web content;
Means for referring to the advertisement database based on the selected category and displaying the advertisement data of the selected category together with the web content to be distributed as an advertisement;
With
Examples feature word database, claim 1-7 contextual advertising distribution computer system, which comprises using a characteristic word database obtained by the feature word automatic learning system according to any one of.

The keyword extraction means performs morpheme analysis on text included in the Web content targeted for advertisement distribution, and extracts morpheme serving as the keyword;
Detecting means for detecting the possibility of a morphological analysis error due to the fact that colloquial text is included in the Web content;
Means for excluding morphemes in which a possibility of morphological analysis error is detected from the keywords;
A content-linked advertisement distribution computer system according to claim 8 .

The content-linked advertisement distribution computer system according to claim 9 , wherein the detection unit detects a possibility of a morphological analysis error based on a morpheme before or after the morpheme serving as the keyword.

The detection means determines whether a morpheme that is a candidate for the keyword is a morpheme that precedes or follows the keyword that is one character of hiragana or one character of katakana and is determined to have an unknown part of speech. The content-linked advertisement distribution computer system according to claim 10, wherein a possibility of an elemental analysis error is detected.

The detection means detects the possibility of an morphological analysis error of the morpheme form serving as the keyword when the morpheme before or after the morpheme serving as the keyword is a lowercase hiragana or a single lowercase katakana character. 12. The content-linked advertisement distribution computer system according to claim 10 or 11 .

A search-linked advertisement distribution computer system that distributes advertisements related to search keywords,
A feature word database storing feature words corresponding to each category;
An ad database that stores ad data for each category;
Selection means for referring to the feature word database based on a search keyword and selecting one or a plurality of product categories corresponding to the search keyword;
Means for referring to the advertisement database based on a selected category and displaying advertisement data of the selected category on a website;
With
Examples feature word database, search advertising distribution computer system, which comprises using a characteristic word database obtained by the feature word automatic learning system according to any one of claims 1-7.

A text data classification computer system,
A feature word database storing feature words corresponding to each predetermined category;
A keyword extraction means for analyzing the text data to be classified and extracting keywords;
Selection means for referring to the feature word database based on the extracted keyword and selecting a category corresponding to the text data;
With
A text classification computer system using the feature word database obtained by the feature word automatic learning system according to any one of claims 1 to 7 as the feature word database.

A computer program for causing a computer to function as the feature word automatic learning system according to any one of claims 1 to 7 .

A computer program for causing a computer to function as the content-linked advertisement distribution computer system according to any one of claims 8 to 12 .

A computer program for causing a computer to function as the search-linked advertisement distribution computer system according to claim 13 .

A computer program for causing a computer to function as the text classification computer system according to claim 14 .

A method of automatically generating a feature word database in which feature words corresponding to respective predetermined categories are stored by a computer,
A computer acquiring a plurality of web texts by a search engine using a main keyword indicating a category as a query;
A computer extracting feature word candidates from the plurality of Web texts obtained from the main keyword as a query;
A computer calculating a degree of association between the main keyword and the feature word candidate;
A computer storing, in each feature word database, feature word candidates having a degree of association higher than a predetermined threshold for each category as a feature word corresponding to the category;
Only including,
In the step of calculating the degree of association between the main keyword and the feature word candidate, a co-occurrence degree indicating a degree of appearance of the main keyword and the feature word candidate together in a Web text group to be searched by a search engine is calculated. A feature word database automatic generation method , wherein the relevance is calculated based on a hit count when a search engine searches the main keyword and the feature word candidate as a query .

A method for distributing advertisements related to the contents of Web content by a computer,
A computer analyzing Web content targeted for advertisement distribution and extracting keywords;
The computer refers to the feature word database obtained by the generation method according to claim 19 based on the extracted keyword, and selects one or more categories corresponding to the Web content;
The computer refers to an advertisement database storing advertisement data corresponding to each category based on the selected category, and displays the advertisement data of the selected category together with the web content to be distributed as an advertisement;
A method comprising the steps of:

A method of delivering advertisements related to search keywords by computer,
The computer refers to the feature word database obtained by the generation method according to claim 19 and selects one or a plurality of categories corresponding to the search keyword based on the search keyword;
A computer referring to an advertisement database storing advertisement data corresponding to each category based on the selected product category, and displaying the advertisement data of the selected category on a website;
A method comprising the steps of:

A method for classifying text data by computer,
Analyzing text data to be classified and extracting keywords;
The step of referring to the feature word database obtained by the generation method according to claim 19 based on the extracted keyword and selecting a category corresponding to the text data;
A method comprising the steps of: