JP4554493B2

JP4554493B2 - Document issuer classification apparatus and program

Info

Publication number: JP4554493B2
Application number: JP2005326107A
Authority: JP
Inventors: 吉秀佐藤; 雅博奥
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2005-11-10
Filing date: 2005-11-10
Publication date: 2010-09-29
Anticipated expiration: 2025-11-10
Also published as: JP2007133659A

Description

本発明は、文書の発行者を分類するための文書発行者分類装置及びプログラムに係り、特に、未知の文書発行者が２以上の型のいずれに該当するかを分類するための文書発行者分類装置及びプログラムに関する。 The present invention relates to a document publisher content RuiSo location and program for classifying issuer of the document, in particular, published document for classifying whether an unknown document issuer corresponds to any of the two or more types on Sha minute RuiSo location and program.

現在、インターネット上には様々な文書が公開され、インターネット上を行き来する電子メールの量も膨大になっている。 Currently, various documents are released on the Internet, and the amount of e-mails going back and forth on the Internet has become enormous.

また、最近では、ウェブログ（Weblog）という簡易HMTL作成システムが普及し、個人による情報公開が従来以上に容易に行えるようになっている。この状況において、不特定多数の利用者を対象に大量発行される広告情報の全流通量に対する割合は増しており、利用者にとっては有用な情報取捨の手間が増している。 Recently, a simple HMTL creation system called Weblog has become widespread, making it easier for individuals to disclose information than before. In this situation, the proportion of the total amount of advertisement information issued to a large number of unspecified users with respect to the total circulation amount has increased, and the effort to dispose of useful information for users has increased.

例えば、電子メールの場合は、利用者が受信する電子メールから、主に広告など、利用者が必要としないメールを識別し、破棄する技術がある。この技術は、メールの文書中の単語を取得して各メールの特徴を表すベクトルを生成し、既知の広告メールの特徴を表すベクトルとの類似度を算出することで、広告か否かの判定を行うものである（例えば、特許文献１参照）。 For example, in the case of e-mail, there is a technique for identifying and discarding e-mails that are not required by the user, such as advertisements, mainly from e-mails received by the user. This technology obtains the words in the mail document, generates a vector representing the characteristics of each mail, and calculates the similarity with the vector representing the characteristics of the known advertising mail, thereby determining whether or not the advertisement is (For example, refer to Patent Document 1).

また、インターネット上の検索エンジンの中には、子供に有害なページを検索結果に表示させないために、上記既存技術同様、ページ内に出現する単語を用いて有害か否かを自動判定する技術を取り入れている検索エンジンもある（例えば、非特許文献１参照）。
特開２００４−０４８４９２号公報 http://partners.search.goo.ne.jp/howto/function/「有害なコンテンツをブロック」 In addition, some search engines on the Internet have a technology that automatically determines whether or not a page is harmful by using a word that appears in the page, in order to prevent a page harmful to children from being displayed in a search result. Some search engines are incorporated (see Non-Patent Document 1, for example).
JP 2004-048492 A http://partners.search.goo.ne.jp/howto/function/“Block harmful content ”

上記既存技術をはじめてとして、利用者にとって有用な情報を含む文書と、利用者が望まない広告等の文書が混在する文書群を対象とし、各文書が広告か否かを判定する技術が複数考案されている。 For the first time in the above existing technologies, multiple technologies have been devised to determine whether each document is an advertisement for a document group that contains documents containing useful information for users and documents such as advertisements that the user does not want. Has been.

しかしながら、上記既存技術では、例えば「大特価」「期間限定」などの商品広告によく用いられるキーワードや、具体的な商品名などを含んだ文書を広告文書の代表として与え、これを基準として未知の文書が広告であるか否かの判定を行う場合を考えると、偶然的にこれらのキーワードを含んだ非広告の文書が、誤って広告と判定される可能性が高い。例えば、ある商品を購入した顧客が利用感想を記述した有用な文書を、広告文書であると判定する場合などが考えられる。上記既存技術のように文脈を考慮せずに出現単語のみで判定を行う技術では、このような問題は避けられない。 However, in the above existing technology, for example, a keyword including keywords often used for product advertisements such as “large special price” and “limited time” and a specific product name is given as a representative of the advertisement document, and unknown based on this If it is determined whether or not the document is an advertisement, there is a high possibility that a non-advertisement document including these keywords will be accidentally determined to be an advertisement. For example, there may be a case where a customer who has purchased a certain product determines that a useful document describing usage impressions is an advertising document. Such a problem is unavoidable in a technique in which the determination is made only with the appearance word without considering the context as in the existing technique.

本発明は、上記の点に鑑みなされたもので、新しい文書が、利用者の望まない例えば、広告のような文書であるか否かを、特に文書中の単語のみでは区別が難しい場合にも対応して判定することが可能な文書発行者分類装置及びプログラムを提供することを目的とする。 The present invention has been made in view of the above points, and whether a new document is a document that is not desired by the user, for example, an advertisement-like document, especially when it is difficult to distinguish only by a word in the document. and to provide a corresponding document publisher component that can be determined RuiSo location and program.

図１は、本発明の原理を説明するための図である。 FIG. 1 is a diagram for explaining the principle of the present invention.

本発明は、文書の発行者を２以上の型に分類するための文書発行者分類方法であって、
文書発行者の分類規則を学習するための文書群を、各文書発行者の分類情報と共に格納した学習文書記憶手段から、同一発行者が記述した文書群を取得し、異なる２文書のタイトル間、本文間、本文中のリンク情報間の何れか１つ以上の類似度を算出し、同種の類似度の平均値（平均類似度）を各発行者毎に算出し、これを該発行者の特徴量とする類似度算出ステップ（ステップ１）と、
各発行者の特徴量と、学習文書記憶手段から取得した該発行者の既知の分類情報を用いて分類結果が未知の発行者を分類するための分類規則を学習し、学習結果を分類規則記憶手段に格納する分類規則学習ステップ（ステップ２）と、
分類結果が未知の発行者が記述した文書群を格納した分類対象文書記憶手段から、同一発行者が記述した文書群を取得して、類似度算出ステップと同一の手順で特徴量を算出し（ステップ３）、分類規則記憶手段に記録されている分類規則に基づいて発行者の分類先を決定し、該分類結果を分類結果記憶手段に格納する分類ステップ（ステップ４）と、を行う。 This onset Ming, a document issuer classification method for classifying the issuer of the document into two or more types,
The document group described by the same issuer is acquired from the learning document storage means that stores the document group for learning the classification rule of the document issuer together with the classification information of each document issuer, and between the titles of two different documents, Calculate one or more similarities between texts and link information in the text, and calculate the average value (average similarity) of similarities for each issuer, which is the characteristics of the issuer A similarity calculation step (step 1) as a quantity;
Using the feature amount of each issuer and the known classification information of the issuer acquired from the learning document storage means, the classification rule for classifying the issuer whose classification result is unknown is learned, and the learning result is stored in the classification rule A classification rule learning step (step 2) to be stored in the means;
A document group described by the same issuer is acquired from a classification target document storage unit storing a document group described by an issuer whose classification result is unknown, and a feature amount is calculated in the same procedure as the similarity calculation step ( Step 3) performs a classification step (step 4) in which the issuer's classification destination is determined based on the classification rule recorded in the classification rule storage means, and the classification result is stored in the classification result storage means.

図２は、本発明の原理構成図である。 FIG. 2 is a principle configuration diagram of the present invention.

本発明（請求項１）は、文書の発行者を２以上の型に分類するための文書発行者分類装置であって、
文書発行者の分類規則を学習するための文書群を、各文書発行者の分類情報と共に格納した学習文書記憶手段１０１と、
分類結果が未知の発行者が記述した文書群を格納した分類対象文書記憶手段１０２と、
学習文書記憶手段１０１ら、同一発行者が記述した文書群を取得し、異なる２文書間のタイトル間、本文間、本文中のリンク情報間のいずれか１つ以上の類似度を算出し、同種の類似度の平均値（平均類似度）を各発行者毎に算出し、これを該発行者の特徴量とする類似度算出手段１０４と、
各発行者の特徴量と、学習文書記憶手段１０１から取得した該発行者の既知の分類情報を用いて、分類結果が未知の発行者を分類するための分類規則を学習し、学習結果を分類規則記憶手段１０６に格納する分類規則学習手段１０５、
分類対象文書記憶手段１０２から、同一発行者が記述した文書群を取得して、類似度算出手段と同一の手順で特徴量を算出し、分類規則記憶手段１０６に記録されている分類規則に基づいて発行者の分類先を決定し、該分類結果を分類結果記憶手段１０８に格納する分類手段１０７と、を有する。 The present invention (Claim 1 ) is a document issuer classification apparatus for classifying document issuers into two or more types,
Learning document storage means 101 storing a document group for learning the document issuer classification rules together with the classification information of each document issuer;
A classification target document storage unit 102 storing a document group described by an issuer whose classification result is unknown;
The learning document storage unit 101 obtains a document group described by the same issuer, calculates one or more similarities between titles between two different documents, between texts, and between link information in the text, and A similarity calculation means 104 that calculates an average value (average similarity) of each issuer and uses this as a feature amount of the issuer;
Using the feature quantity of each issuer and the known classification information of the issuer acquired from the learning document storage unit 101, the classification rule for classifying the issuer whose classification result is unknown is learned, and the learning result is classified Classification rule learning means 105 stored in the rule storage means 106,
The document group described by the same issuer is acquired from the classification target document storage unit 102, the feature amount is calculated in the same procedure as the similarity calculation unit, and based on the classification rule recorded in the classification rule storage unit 106 A classification unit 107 that determines a classification destination of the issuer and stores the classification result in the classification result storage unit 108.

また、本発明（請求項２）は、学習文書記憶手段１０１、並びに、分類対象文書記憶手段１０２から取得する各文書から特定の種類のキーワードを取得し、１文書あたりの該キーワードの平均出現回数を発行者毎に算出するキーワード集計手段を更に有し、
分類規則学習手段１０５、並びに、分類手段１０７は、
更に平均出現回数を特徴量として加え、分類規則の学習や文書発行者の分類を行う手段を含む。 Further, according to the present invention (claim 2 ), a specific type of keyword is acquired from each document acquired from the learning document storage unit 101 and the classification target document storage unit 102, and the average number of appearances of the keyword per document is obtained. And a keyword counting means for calculating for each issuer,
The classification rule learning means 105 and the classification means 107 are:
Furthermore, a means for adding the average number of appearances as a feature quantity and learning classification rules and classifying document issuers is included.

また、本発明（請求項３）は、キーワード集計手段において、
学習文書記憶手段１０１、並びに、分類対象文書記憶手段１０２から取得した文書中の金額表現を抽出し、１文書あたりの該金額表現の平均出現回数を集計する手段を含む。 Further, the present invention (Claim 3 ) is a keyword counting means,
A learning document storage unit 101 and a unit for extracting a monetary expression in a document acquired from the classification target document storage unit 102 and totaling the average number of appearances of the monetary expression per document are included.

本発明（請求項４）は、文書発行者を分類するための学習材料となる文書群を格納した学習文書記憶手段、分類結果が未知の発行者が記述した文書群を格納した判別対象文書記憶手段、分類規則記憶手段、分類結果記憶手段とを有するコンピュータを、
請求項１乃至３のいずれか１項に記載の文書発行分類装置として機能させる文書発行者分類プログラムである。

According to the present invention (claim 4 ), learning document storage means storing a document group serving as learning material for classifying document issuers, and a discrimination target document storage storing a document group described by an issuer whose classification result is unknown A computer having means, classification rule storage means, and classification result storage means,
A document issuer classification program that functions as the document issue classification device according to any one of claims 1 to 3 .

上記のように本発明は、発行者を分類するための特徴量として、単語の出現頻度などではなく、同一発行者が記述する文書間の類似度や、金額表現のような特定の種別のキーワードの出現頻度を用いるものである。本発明は、文書の意味に基づく分類ではなく、発行者が記述する文書の癖などの特徴に基づいた分類を行うため、商品名など特定の単語を偶然多用した発行者が、広告であると誤分類される問題を解決することができる
また、同一発行者の文書群を並べ替えた後に特徴量を算出することで、いくつかの定型文の何れかを無作為に選択して使用するだけの広告目的の発行者と、日記など毎日内容の異なる非広告文書を記述する発行者との間の特徴量の差を明確にすることができ、これらを正しく分類することができる。 As described above, according to the present invention, as a feature quantity for classifying an issuer, a keyword of a specific type such as a similarity between documents described by the same issuer or a monetary expression is used instead of a word appearance frequency. Is used. The present invention does not classify based on the meaning of the document, but performs classification based on characteristics of the document described by the issuer, such that the issuer who accidentally used a specific word such as a product name is an advertisement. The problem of misclassification can be solved. Also, by calculating the feature amount after sorting the document group of the same publisher, you can select and use any of the fixed phrases at random. The feature amount difference between the issuer for advertising purposes and the issuer that describes non-advertisement documents with different contents such as diaries every day can be clarified, and these can be classified correctly.

以下、図面と共に本発明の実施の形態を説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

本発明は、文書発行者の分類結果が既知の学習用の文書から学習した分類規則を記録しておき、分類結果が未知の発行者が記述した文書を用いて該発行者の分類先を決定するものである。 The present invention records a classification rule learned from a learning document whose classification result of the document issuer is known, and determines a classification destination of the issuer using a document described by an issuer whose classification result is unknown To do.

［第１の実施の形態］
図３は、本発明の第１の実施の形態における文書発行者分類装置の構成図である。 [First Embodiment]
FIG. 3 is a configuration diagram of the document issuer classification apparatus according to the first embodiment of the present invention.

同図に示す文書発行者分類装置は、学習文書記録部１０１、分類対象文書記録部１０２、順序決定部１０３、類似度算出部１０４、分類規則学習部１０５、分類規則記録部１０６、分類部１０７、分類結果記録部１０８から構成される。 The document issuer classification apparatus shown in FIG. 1 includes a learning document recording unit 101, a classification target document recording unit 102, an order determination unit 103, a similarity calculation unit 104, a classification rule learning unit 105, a classification rule recording unit 106, and a classification unit 107. The classification result recording unit 108 is configured.

学習文書記録部１０１は、ディスク装置等の記録媒体であり、文書発行者を分類するための学習材料となる学習群を記録する。 The learning document recording unit 101 is a recording medium such as a disk device, and records a learning group serving as a learning material for classifying document issuers.

分類対象文書記録部１０２は、ディスク装置等の記録媒体であり、分類結果が未知の発行者が記述した文書群を記録する。学習文書記録部１０１に記録する文書は、発行者ＩＤ，発行者分類（広告／非広告）、文書ＩＤ、タイトル、本文等からなり、分類対象文書記録部１０２に記録する文書は、発行者ＩＤ、文書ＩＤ，タイトル、本文等からなる。 The classification target document recording unit 102 is a recording medium such as a disk device, and records a document group described by an issuer whose classification result is unknown. A document to be recorded in the learning document recording unit 101 includes an issuer ID, an issuer classification (advertisement / non-advertisement), a document ID, a title, a body text, and the like. A document to be recorded in the classification target document recording unit 102 is an issuer ID. , Document ID, title, text, etc.

順序決定部１０３は、分類規則の学習時には学習文書記録部１０１、発行者の分類時には分類対象文書記録部１０２に記録された文書群を、それぞれある条件に基づいて並べ替える。 The order determination unit 103 rearranges the document groups recorded in the learning document recording unit 101 when the classification rule is learned and in the classification target document recording unit 102 when the issuer is classified, based on a certain condition.

類似度算出部１０４は、順序決定部１０３が決定した順序に従って学習文書記録部１０１、もしくは、分類対象文書記録部１０２内の隣接文書間の平均類似度を算出する。 The similarity calculation unit 104 calculates the average similarity between adjacent documents in the learning document recording unit 101 or the classification target document recording unit 102 according to the order determined by the order determination unit 103.

分類規則学習部１０５は、学習文書記録部１０１に記録された文書群に対して類似度算出部１０４が算出した類似度を用い、分類結果が未知の発行者を分類するための分類規則を学習する。 The classification rule learning unit 105 uses the similarity calculated by the similarity calculation unit 104 for the document group recorded in the learning document recording unit 101 to learn a classification rule for classifying an issuer whose classification result is unknown. To do.

分類規則記録部１０６は、ディスク装置等の記録媒体であり、分類規則学習部１０５における学習の結果得られる分類規則を記録する。 The classification rule recording unit 106 is a recording medium such as a disk device, and records the classification rule obtained as a result of learning in the classification rule learning unit 105.

分類部１０７は、分類対象文書記録部１０２に記録された文書群に対して類似度算出部１０４が算出した発行者毎の平均類似度、並びに、分類規則記録部１０６に記録された分類規則を用い、該発行者を分類する。 The classification unit 107 displays the average similarity for each issuer calculated by the similarity calculation unit 104 for the document group recorded in the classification target document recording unit 102 and the classification rule recorded in the classification rule recording unit 106. Used to classify the issuer.

分類結果記録部１０８は、分類部１０７の分類結果を記録する。 The classification result recording unit 108 records the classification result of the classification unit 107.

図４は、本発明の第１の実施の形態における概要動作を示すフローチャートである。 FIG. 4 is a flowchart showing an outline operation in the first embodiment of the present invention.

類似度算出部１０４において、文書発行者の分類結果が既知の学習用文書を学習文書記録部１０１から読み出して、同一発行者の隣接文書を比較してタイトル間、本文間、本文中のリンク情報間のいずれか１以上の類似度を算出する。同一発行者の全文書にわたって隣接文書間の各種類似度を算出し、同種の類似度を平均することで、タイトル間の平均類似度、本文間の平均類似度、本文中のリンク情報間の平均類似度を算出し、これを該発行者の特徴量とする（ステップ１０１）。 In the similarity calculation unit 104, a learning document whose classification result of the document issuer is known is read from the learning document recording unit 101, and adjacent documents of the same issuer are compared to link information between titles, between texts, and in the text. Any one or more similarities between them are calculated. By calculating various similarities between adjacent documents across all documents of the same publisher and averaging the similarities, the average similarity between titles, the average similarity between texts, and the average between link information in the text The similarity is calculated and used as the feature amount of the issuer (step 101).

続いて、分類規則学習部１０５において、学習文書記録憶部１０１から取得した各発行者の分類結果とステップ１０１で求められた特徴量を用いて分類規則を学習し、学習の結果である分類規則を分類規則記録部１０６に記録しておく（ステップ１０２）。 Subsequently, the classification rule learning unit 105 learns the classification rule using the classification result of each issuer acquired from the learning document recording storage unit 101 and the feature amount obtained in step 101, and the classification rule as a result of the learning is obtained. Is recorded in the classification rule recording unit 106 (step 102).

一方、分類対象文書記録部１０２から取得した、分類結果が未知の発行者が記述した文書を解析して、上記のステップ１０１と同様に特徴量を算出し（ステップ１０３）、分類規則記憶部１０６の分類規則と照らし合わせて該発行者の最も可能性の高い分類先を決定する（ステップ１０４）。 On the other hand, the document described by the issuer whose classification result is unknown, obtained from the classification target document recording unit 102, is analyzed, and the feature amount is calculated in the same manner as in step 101 (step 103), and the classification rule storage unit 106 is obtained. The most likely classification destination of the issuer is determined by comparing with the classification rule (step 104).

なお、上記のステップ１０１の処理を行う前に、順序決定部１０３において、同一発行者が記述した文書を一定の基準に従って並べ替えておき、ステップ１０１では、類似度算出部１０４において並べ替えの結果に基づいて隣接する２文書間の類似度を算出するようにしてもよい。この場合は、ステップ１０３の処理を行う前にも、分類対象文書記録部１０２から取得する文書群を、順序決定部１０３において同様の基準に従って並べ替えておく。 Prior to the processing in step 101 described above, the order determination unit 103 rearranges the documents described by the same issuer according to a certain standard. In step 101, the similarity calculation unit 104 rearranges the results. Based on the above, the similarity between two adjacent documents may be calculated. In this case, the document group acquired from the classification target document recording unit 102 is rearranged according to the same standard in the order determination unit 103 before the processing of step 103 is performed.

以下、図３の構成に基づいて動作を詳細に説明する。 Hereinafter, the operation will be described in detail based on the configuration of FIG.

本発明の文書発行者分類装置では、分類規則学習部１０５において、予め、分類結果が既知の発行者が記述した文書群を用いて学習した分類規則を分類規則記録部１０６に記録しておき、分類部１０７において、分類対象文書記録部１０２の分類結果が未知の発行者が記述した文書群を用いて、当該発行者の分類先を決定する。このため、学習文書記録部１０１に格納される学習用の文書としては、同一発行者が記述した複数の文書を発行者分準備し、各発行者に対して予め分類の結果を付与しておく必要がある。 In the document issuer classification apparatus of the present invention, the classification rule learning unit 105 records in advance a classification rule learned using a document group described by a publisher whose classification result is known in the classification rule recording unit 106, The classification unit 107 determines a classification destination of the issuer using a document group described by an issuer whose classification result of the classification target document recording unit 102 is unknown. For this reason, as a learning document stored in the learning document recording unit 101, a plurality of documents described by the same issuer are prepared for the issuer, and a classification result is given to each issuer in advance. There is a need.

また、分類結果が未知の発行者についても複数の文書が存在することが前提となる。 It is also assumed that there are multiple documents for issuers whose classification results are unknown.

図５は、本発明の一実施の形態における学習用文書記録部の学習用文書の例である。同図（Ａ）は分類情報が「広告」である文書の例を示し、同図（Ｂ）は分類情報が「非広告」の文書の例を示している。 FIG. 5 is an example of a learning document in the learning document recording unit in one embodiment of the present invention. FIG. 4A shows an example of a document whose classification information is “advertisement”, and FIG. 4B shows an example of a document whose classification information is “non-advertisement”.

各文書は、発行者を表す発行者ＩＤ、当該発行者の記述する文書が商品等の宣伝を目的とした広告であるか否かを表す分類、文書ＩＤ、タイトル、本文、の各属性を有する。例えば、同図（Ａ）の広告文書の例では、発行者ＩＤ「ＡＢＣＤ」の記述する文書は広告であり、文書ＩＤ「０００１」の文書のタイトルは「今日の注目商品（９月１日）」であって、「本日ご紹介する商品は…」と続く本文を有する。本文中には、商品の詳細情報を記述した文書へのリンク情報（ＵＲＬ）が含まれている。同図（Ｂ）に示す非広告文書の例も同様に、「非広告」という分類などの各属性を有する。なお、本発明の文書発行者分類装置では、発行者ＩＤ毎に広告か非広告かの判定を行うため、同一発行者の分類は全て統一しておくものとする。 Each document has attributes of an issuer ID representing an issuer, a classification indicating whether the document described by the issuer is an advertisement for the purpose of advertising products, etc., a document ID, a title, and a text. . For example, in the example of the advertisement document in FIG. 5A, the document described by the publisher ID “ABCD” is an advertisement, and the title of the document with the document ID “0001” is “Today's attention product (September 1). ”And has a text that continues with“ Products introduced today… ”. The text includes link information (URL) to a document describing detailed information of the product. Similarly, the example of the non-advertisement document shown in FIG. 5B has attributes such as a classification “non-advertisement”. In the document issuer classification apparatus according to the present invention, since it is determined for each issuer ID whether it is an advertisement or a non-advertisement, all the classifications of the same issuer are assumed to be unified.

次に、分類規則の学習について説明する。 Next, classification rule learning will be described.

図６は、本発明の第１の実施の形態における分類規則の学習手順のフローチャートである。 FIG. 6 is a flowchart of the classification rule learning procedure according to the first embodiment of the present invention.

最初に、順序決定部１０３が、学習文書記録部１０１を参照し、同一発行者が記述した文書を並べ替え、その結果を学習文書記録部１０１に格納する（ステップ３０１）。並べ替える基準は、例えば、文書のタイトルに基づいて辞書式配列に並べ替える。例えば、図５に示した発行者ＩＤ「ＡＢＣＤ」の発行者が「今日の注目商品（○月×日）」という形式のタイトルを持つ文書と、「昨日の注文数ランキング」というタイトルを持つ文書を交互に記述し、頻繁に発行する発行者であったとする。これらを、タイトルに基づいて辞書式配列に並べ替えると、同一形式のタイトル毎に文書が分離され、前半に「昨日の注文数ランキング」というタイトルの文書群が集約され、後半に「今日の注目商品（○月×日）」という形式のタイトルの文書群が集約される。つまり、タイトルの型によって前半と後半に分離される。３以上のタイトルの型を使い分ける発行者の場合でも、この並び替えによって、同一形式のタイトルが連続するような順序が同様に得られる。 First, the order determination unit 103 refers to the learning document recording unit 101, rearranges the documents described by the same issuer, and stores the result in the learning document recording unit 101 (step 301). For example, the reordering is performed in a lexicographic arrangement based on the document title. For example, a document having an issuer ID “ABCD” issuer shown in FIG. 5 having a title of the form “Today's attention product (○ month × day)” and a document having a title of “Yesterday order quantity ranking” Suppose that the issuer is a frequent issuer. When these are rearranged into a lexicographical arrangement based on titles, the documents are separated for each title of the same format, and a group of documents titled “Yesterday's order quantity ranking” is aggregated in the first half. A group of documents having a title of “product (○ month × day)” is collected. In other words, the first half and the second half are separated according to the title type. Even in the case of an issuer who uses three or more title types, an order in which titles of the same format continue can be similarly obtained by this rearrangement.

但し、ステップ３０１の処理は必須ではなく、必ずしも並べ替える必要はない。 However, the processing of step 301 is not essential and does not necessarily need to be rearranged.

１発行者の記述した文書群の並べ替えを終えると、続いて、類似度算出部１０４が、隣接文書間の平均類似度を算出する（ステップ３０２）。このとき、順序決定部１０３が並べ替えた結果に基づいて、１文書ずつ学習文書記録部１０１からタイトルと本文を取得する。直前に取得した文書のタイトルと本文とを比較することで隣接文書間の類似度を算出し、発行者「ＡＢＣＤ」の全文書にわたって類似度を算出した後、類似度を平均して平均類似度を算出する。 When the rearrangement of the document group described by one issuer is completed, the similarity calculation unit 104 calculates the average similarity between adjacent documents (step 302). At this time, the title and the text are acquired from the learning document recording unit 101 one document at a time based on the result rearranged by the order determining unit 103. The similarity between adjacent documents is calculated by comparing the title and body of the document acquired immediately before, the similarity is calculated over all the documents of the issuer “ABCD”, and the similarity is averaged to obtain the average similarity Is calculated.

なお、ここで算出する平均類似度としては、タイトルの類似度、本文の類似度、リンク情報の類似度を用いる。 As the average similarity calculated here, the similarity of the title, the similarity of the text, and the similarity of the link information are used.

タイトルの類似度には、タイトル中に含まれる単語で構成されるベクトル間の類似度を用いるのがよい。例えば、「今日の注目商品（９月１日）」というタイトルから『今日』『注目』『商品』『９月』『１日』という５単語を取得し、単語ベクトルを構成する。こうして構成したタイトルの単語ベクトル同士を隣接文書間で比較し、ユークリッド距離やコサイン類似度など、ベクトル間の関連性の強弱を表す値を算出し、類似度を決定する。この方法によると、「昨日の注文数ランキング」という２つの全く同じタイトルを比較すると類似度は最大になり、「今日の注目商品（○月×日）」という形式のタイトル同士を比較しても、一部が共通しているために類似度は高い。従って、発行者ＩＤ「ＡＢＣＤ」の文書について算出したタイトルの平均類似度は高くなる。一方で、図５（Ｂ）の非広告文書の例としてあげた発行者ＩＤ「ＥＦＧＨ」の文書のように日記形式の文書を記述、発行する発行者は、タイトルが文書毎に変わるため、タイトルの平均類似度は、発行者「ＡＢＣＤ」に比べて低い値となる。 For the similarity between titles, it is preferable to use the similarity between vectors composed of words included in the title. For example, five words “Today”, “Attention”, “Product”, “September”, and “1 day” are acquired from the title “Today's attention product (September 1)”, and a word vector is constructed. The word vectors of the titles configured in this way are compared between adjacent documents, a value representing the strength of relevance between vectors such as Euclidean distance and cosine similarity is calculated, and the similarity is determined. According to this method, comparing two identical titles of “Yesterday's order quantity ranking” maximizes the similarity, and even when comparing titles in the format of “Today's hot products (○ month x day)” The similarity is high because some are in common. Therefore, the average similarity of titles calculated for the document with the publisher ID “ABCD” is high. On the other hand, the issuer who describes and issues a diary-type document like the document of the issuer ID “EFGH” given as an example of the non-advertisement document in FIG. The average similarity is lower than that of the issuer “ABCD”.

本文の類似度もタイトルの類似度と同様、単語のベクトルによって算出するのがよい。この場合も、例えば、発行者「ＡＢＣＤ」が発行する「今日の注目商品（○月×日）」の形式のタイトルを持つ文書では「注文番号」や「詳しくはこちら」などの表現が、また、「昨日の注文数ランキング」というタイトルの文書には「１位」「２位」などの順位を表す用語のような固定的に用いられる表現が多いために平均類似度が増す。 Similar to the similarity of the title, the similarity of the body should be calculated from a word vector. In this case, too, for example, a document with a title of the form “Today's hot product (○ month × day)” issued by the issuer “ABCD” has expressions such as “order number” and “click here for details” Since the document titled “Yesterday's order quantity ranking” has many fixedly used expressions such as “first place” and “second place”, the average similarity is increased.

リンク情報の類似度は、図５に示した例の本文中に含まれるＵＲＬを解析することで行う。図５（Ａ）の広告文書の例に含まれるリンク情報が、本文中で紹介する商品の詳細情報を記述した文書に誘導するものであるとすると、同発行者が発行する他の文書におけるリンク情報も、同様に商品の詳細情報の文書に誘導するものである可能性が高い。ＵＲＬは、インターネット上での文書の素材を表す文字列であり、例えば、商品を紹介する同様の文書のＵＲＬ間には共通部分も多く、ＵＲＬ自体の類似性が高い。ＵＲＬは、スラッシュ「／」という記号で文書の所在の階層的な位置を表すため、ＵＲＬをスラッシュで区切った複数の文字列を、上述の単語ベクトルのように扱えば、リンク情報間の類似度が算出できる。一方で、図５（Ｂ）のような非広告文書のように日記形式の文書では、同一発行者であっても文書が異なればリンク情報が異なる確率が高いため、リンク情報間の類似度は低くなる。 The similarity of the link information is determined by analyzing the URL included in the text of the example shown in FIG. Assuming that the link information included in the example of the advertisement document in FIG. 5A leads to a document describing the detailed information of the product introduced in the text, links in other documents issued by the same publisher The information is also likely to be guided to a detailed product information document. The URL is a character string representing the material of a document on the Internet. For example, there are many common parts between URLs of similar documents that introduce products, and the similarity of the URL itself is high. Since the URL represents the hierarchical position of the document location by the symbol “/”, the similarity between the link information can be obtained by treating a plurality of character strings obtained by dividing the URL with a slash like the above word vector. Can be calculated. On the other hand, in a diary-type document such as a non-advertisement document as shown in FIG. 5B, even if the same issuer is used, there is a high probability that link information will be different if the document is different. Lower.

ここまで述べたように、タイトル、本文、リンク情報のそれぞれについて算出した平均類似度を、発行者「ＡＢＣＤ」や「ＥＦＧＨ」の特徴量として分類規則学習部１０５がメモリ（図示せず）に記憶し、全ての発行者についてそれぞれの平均類似度を算出し終えるまで（ステップ３０３、Yes）処理を繰り返す。 As described above, the average similarity calculated for each of the title, text, and link information is stored in the memory (not shown) by the classification rule learning unit 105 as the feature amount of the issuer “ABCD” or “EFGH”. Then, the process is repeated until the calculation of the average similarity for all the issuers is completed (step 303, Yes).

続いて、分類規則学習部１０５がここまで算出した全発行者の各種特徴量と、学習文書記録部１０１に記録されている各発行者の分類（広告または非広告）を用い、分類結果が未知の発行者を分類するための分類規則を学習し（ステップ３０４）、学習の結果を分類規則記録部１０６に記録する（ステップ３０５）。 Subsequently, the classification result is unknown using the various feature amounts of all the issuers calculated up to this point by the classification rule learning unit 105 and the classification (advertisement or non-advertisement) of each issuer recorded in the learning document recording unit 101. The classification rule for classifying the issuer is learned (step 304), and the learning result is recorded in the classification rule recording unit 106 (step 305).

複数の発行者の特徴量と、各発行者の既知の分類を用いて規則を学習する方法は、例えば、「『パターン認識と学習の統計学』麻生他著、岩波書店、pp20-23、pp107-108」に記載の決定木やサポートベクターマシンなどの手法を用いるものとする。これらはいずれも分類が既知の発行者の特徴量に基づいて学習を行って分類のための分類規則を学習、記録しておき、分類が未知の発行者について算出した特徴量を当該分類規則に照らし合わせて、最も可能性の高い分類先に発行者を分類するための技術である。 The method of learning rules using the feature quantities of multiple issuers and the known classification of each issuer is, for example, “Statistics of Pattern Recognition and Learning” Aso et al., Iwanami Shoten, pp20-23, pp107. -108 ", such as decision trees and support vector machines. Each of these learns based on the feature quantity of the publisher whose classification is known, learns and records the classification rule for classification, and the feature quantity calculated for the issuer whose classification is unknown is stored in the classification rule. This is a technique for classifying issuers into the most likely classification destinations.

最後に、分類結果が未知の発行者の分類を実行するための手順について説明する。 Finally, a procedure for executing classification of an issuer whose classification result is unknown will be described.

図７は、本発明の第１の実施の形態における分類結果が未知の発行者の分類手順のフローチャートである。 FIG. 7 is a flowchart of a classification procedure for an issuer whose classification result is unknown according to the first embodiment of this invention.

ステップ４０１、ステップ４０２の処理は、学習文書記録部１０１ではなく、分類対象文書記録部１０２に記録された文書を用いて処理を行う点を除いては、ステップ３０１、ステップ３０２と全く同様である。 The processing of step 401 and step 402 is exactly the same as that of step 301 and step 302 except that the processing is performed using the document recorded in the classification target document recording unit 102 instead of the learning document recording unit 101. .

分類対象文書記録部１０２に格納されている分類対象文書とは、図５と同様に、発行者ＩＤ、文書タイトル、本文を有するが、広告か非広告かが未知であるため、分類の属性は持たない。 As in FIG. 5, the classification target document stored in the classification target document recording unit 102 has an issuer ID, a document title, and a body text. However, since the advertisement or non-advertisement is unknown, the classification attribute is do not have.

分類結果が未知の発行者について、タイトル、本文、リンク情報の各平均類似度の算出を終えると、分類部１０７が、分類規則記録部１０６に記録された学習結果である分類規則を参照し、該発行者が広告であるか非広告であるか、より可能性の高い分類先の分類を行う（ステップ４０３）。分類の結果は、発行者ＩＤと対にして分類結果記憶部１０８に出力する（ステップ４０４）。 For the issuer whose classification result is unknown, when the calculation of the average similarity of title, text, and link information is finished, the classification unit 107 refers to the classification rule that is the learning result recorded in the classification rule recording unit 106, Whether the issuer is an advertisement or a non-advertisement is classified with a higher possibility (step 403). The classification result is output to the classification result storage unit 108 as a pair with the issuer ID (step 404).

以上で、１発行者についての分類処理を終え、それを全発行者について繰り返して（ステップ４０５）、処理を終了する。 This completes the classification process for one issuer, repeats it for all issuers (step 405), and ends the process.

図８は、本発明の第１の実施の形態における分類の結果の例を示す。 FIG. 8 shows an example of the classification result in the first exemplary embodiment of the present invention.

上記で求められた分類結果は、分類結果記憶部１０８に図８に示すように格納される。本実施の形態では、広告か非広告かの判定を行ったため、発行者「ＰＰＰＰ」は広告、「ＱＱＱＱ」は非広告と、いったように発行者について広告もしくは非広告の分類が記録されている。 The classification results obtained above are stored in the classification result storage unit 108 as shown in FIG. In this embodiment, since it is determined whether it is an advertisement or a non-advertisement, the issuer “PPPP” is an advertisement, “QQQQ” is a non-advertisement, and so on. Yes.

［第２の実施の形態］
図９は、本発明の第２の実施の形態における文書発行者分類装置の構成を示し、図３と同一構成部分には同一符号を付し、その説明を省略する。 [Second Embodiment]
FIG. 9 shows the configuration of the document issuer classification apparatus according to the second embodiment of the present invention. The same components as those in FIG.

本実施の形態では、前述の図４に示すステップ１０１、ステップ１０３で算出する特徴量として、タイトル間、本文間、本文中のリンク情報間のいずれか１以上の平均類似度に、同一発行者が記述する文書における特定の種類のキーワードの平均出現回数を加えたものを特徴量とするものである。キーワードとしては、例えば「千五百円」「４８ドル」などの金額表現であってもよい。 In the present embodiment, the same issuer is used as the feature amount calculated in the above-described step 101 and step 103 shown in FIG. 4 with one or more average similarities between titles, between texts, and between link information in texts. The feature amount is obtained by adding the average number of occurrences of a specific type of keyword in the document described by. The keyword may be a monetary expression such as “1,000 yen” or “$ 48”, for example.

同図に示す文書発行者分類装置は、図３に示した構成に、特定の種類のキーワードの文書中での平均出現回数を発行者毎に算出するキーワード集計部６０１を加えたものである。 The document issuer classification apparatus shown in the figure is obtained by adding a keyword totaling unit 601 that calculates the average number of appearances of a specific type of keyword in a document to the issuer in addition to the configuration shown in FIG.

図１０は、本発明の第２の実施の形態における分類規則の学習手順のフローチャートである。 FIG. 10 is a flowchart of the classification rule learning procedure according to the second embodiment of the present invention.

順序決定部１０３においてタイトルの辞書的配列に並び替えた順序に基づいて、類似度算出部１０４が、隣接文書間の平均類似度を算出する（ステップ７０１、ステップ７０２）。当該処理は、前述の第１の実施の形態と同様である。本実施の形態では、タイトル、本文リンク情報の平均類似度に加え、キーワード集計部６０１においてキーワードの平均出現回数を集計する（ステップ７０３）。キーワードの平均出現回数は、ある発行者が記述した１文書あたりの特定の種別のキーワードの平均出現回数である。 Based on the order rearranged in the lexicographic arrangement of titles in the order determination unit 103, the similarity calculation unit 104 calculates an average similarity between adjacent documents (steps 701 and 702). This process is the same as that in the first embodiment. In this embodiment, in addition to the average similarity between the title and the text link information, the keyword count unit 601 counts the average number of appearances of the keyword (step 703). The average number of appearances of a keyword is the average number of appearances of a specific type of keyword per document described by a certain publisher.

本実施の形態も発行者を広告と非広告に分類する例であり、キーワードとしては金額表現全般を用いるのがよい。つまり、キーワード集計部６０１において、学習文書記録部１０１の各文書中の金額表現を金額の大小によらず取得し、文中での出現回数を集計する。これを各文書について実施し、金額表現の平均出現回数を発行者毎に算出する。 This embodiment is also an example in which the issuer is classified into an advertisement and a non-advertisement, and it is preferable to use the entire monetary expression as a keyword. That is, the keyword totaling unit 601 acquires the money amount expression in each document of the learning document recording unit 101 regardless of the amount of money, and totals the number of appearances in the sentence. This is performed for each document, and the average number of appearances of the monetary expression is calculated for each issuer.

ステップ７０４において、キーワード集計部６０１が全発行者についての平均類似度、キーワード（本実施の形態では金額表現）の平均出現回数を算出し終えると（ステップ７０４、Ｙｅｓ）、分類規則学習部１０５が当該平均類似度と当該平均出現回数の双方を用いて、分類規則を学習し（ステップ７０５）、規則の学習結果を分類規則記録部１０６に記録する（ステップ７０６）。 In step 704, when the keyword totaling unit 601 finishes calculating the average similarity for all issuers and the average number of appearances of the keyword (in this embodiment, money expression) (step 704, Yes), the classification rule learning unit 105 A classification rule is learned using both the average similarity and the average number of appearances (step 705), and the rule learning result is recorded in the classification rule recording unit 106 (step 706).

図１１は、本発明の第２の実施の形態における分類結果が未知の発行者を分類するための手順のフローチャートである。 FIG. 11 is a flowchart of a procedure for classifying issuers whose classification results are unknown according to the second embodiment of this invention.

図１０に示した学習の手順同様に、類似度算出部１０４における平均類似度の算出（ステップ８０１、ステップ８０２）に加えて、キーワード集計部６０１において、金額表現の平均出現回数を集計する（ステップ８０３）。次に、分類部１０７において、平均類似度と平均出現回数を共に用いて広告か非広告かの分類を行い（ステップ８０４）、結果を分類結果記録部１０８に記録する（ステップ８０５）。 Similar to the learning procedure shown in FIG. 10, in addition to the calculation of the average similarity in the similarity calculation unit 104 (steps 801 and 802), the keyword totaling unit 601 totals the average number of appearances of the monetary expression (steps). 803). Next, the classification unit 107 classifies the advertisement or non-advertisement using both the average similarity and the average number of appearances (step 804), and records the result in the classification result recording unit 108 (step 805).

なお、上記の第１・第２の実施の形態の動作をプログラムとして構築し、文書発行者分類装置として利用されるコンピュータにインストールして実行させる、または、ネットワークを介して流通させることが可能である。 The operations of the first and second embodiments described above can be constructed as a program and installed in a computer used as a document issuer classification device for execution or distributed via a network. is there.

また、構築されたプログラムを、ハードディスク装置や、フレキシブルディスク・ＣＤ−ＲＯＭ等の可搬記憶媒体に格納し、コンピュータにインストールして実行させることも可能である。 The constructed program can be stored in a portable storage medium such as a hard disk device or a flexible disk / CD-ROM, and installed in a computer to be executed.

なお、本発明は、上記の実施の形態に限定されることなく、特許請求の範囲内において種々変更・応用が可能である。 The present invention is not limited to the above-described embodiment, and various modifications and applications can be made within the scope of the claims.

本発明は、文書を発行する発行者を分類するための技術に適用可能である。 The present invention is applicable to a technique for classifying issuers who issue documents.

本発明の原理を説明するための図である。It is a figure for demonstrating the principle of this invention. 本発明の原理構成図である。It is a principle block diagram of this invention. 本発明の第１の実施の形態における文書発行者分類装置の構成図である。It is a block diagram of the document issuer classification apparatus in the 1st Embodiment of this invention. 本発明の第１の実施の形態における概要動作のフローチャートである。It is a flowchart of the outline | summary operation | movement in the 1st Embodiment of this invention. 本発明の第１の実施の形態における学習用文書記録部の学習用文書の例である。It is an example of the learning document of the learning document recording part in the 1st Embodiment of this invention. 本発明の第１の実施の形態における分類規則の学習手順のフローチャートである。It is a flowchart of the learning procedure of the classification rule in the 1st Embodiment of this invention. 本発明の第１の実施の形態における分類結果が未知の発行者の分類手順のフローチャートである。It is a flowchart of the classification | category procedure of the issuer whose classification result is unknown in the 1st Embodiment of this invention. 本発明の第１の実施の形態における分類結果の例である。It is an example of the classification result in the 1st Embodiment of this invention. 本発明の第２の実施の形態における文書発行者分類装置の構成図である。It is a block diagram of the document issuer classification device in the 2nd Embodiment of this invention. 本発明の第２の実施の形態における分類規則の学習手順のフローチャートである。It is a flowchart of the learning procedure of the classification rule in the 2nd Embodiment of this invention. 本発明の第２の実施の形態における分類結果が未知の発行者の分類手順のフローチャートである。It is a flowchart of the classification | category procedure of the issuer whose classification result is unknown in the 2nd Embodiment of this invention.

Explanation of symbols

１０１学習文書記憶手段、学習文書記録部
１０２分類対象文書記憶手段、分類対象文書記録部
１０３順序決定部
１０４類似度算出手段、類似度算出部
１０５分類規則学習手段、分類規則学習部
１０６分類規則記憶手段、分類規則記憶部
１０７分類手段、分類部
１０８分類結果記憶手段、分類結果記憶部
６０１キーワード集計部 101 learning document storage unit, learning document recording unit 102 classification target document storage unit, classification target document recording unit 103 order determination unit 104 similarity calculation unit, similarity calculation unit 105 classification rule learning unit, classification rule learning unit 106 classification rule storage Means, classification rule storage unit 107 Classification unit, classification unit 108 Classification result storage unit, classification result storage unit 601 Keyword totaling unit

Claims

A document issuer classification device for classifying document issuers into two or more types,
Learning document storage means for storing a document group for learning the classification rule of the document issuer together with the classification information of each document issuer;
Classification target document storage means storing a document group described by an issuer whose classification result is unknown,
A document group described by the same publisher is acquired from the learning document storage means, and one or more similarities between titles between two different documents, between texts, and between link information in the text are calculated, A similarity calculation means for calculating an average value (average similarity) of each issuer and using this as a feature amount of the issuer;
Using the feature quantity of each issuer and the known classification information of the issuer acquired from the learning document storage means, the classification rule for classifying the issuer whose classification result is unknown is learned, and the learning result is classified A classification rule learning means to be stored in the rule storage means;
A classification rule recorded in the classification rule storage unit by acquiring a document group described by the same issuer from the classification target document storage unit, calculating the feature amount in the same procedure as the similarity calculation unit A classification unit that determines a classification destination of the issuer based on the classification result and stores the classification result in the classification result storage unit;
A document issuer classifying apparatus comprising:

Keyword learning means for acquiring a specific type of keyword from each document acquired from the learning document storage means and the classification target document storage means, and calculating an average number of appearances of the keyword per document for each issuer In addition,
The classification rule learning means, and the classification means,
Further, the average number of appearances is added as a feature amount, and includes means for learning classification rules and classifying document publishers.
The document issuer classification apparatus according to claim 1 .

The keyword counting means is:
The learning document storage means, and a means for extracting the money amount expression in the document acquired from the classification target document storage means, and totaling the average number of appearances of the money amount expression per document,
The document issuer classification device according to claim 2.

Learning document storage means storing a document group as a learning material for classifying document issuers, distinction target document storage means storing a document group described by an issuer whose classification result is unknown, classification rule storage means, classification result A computer having storage means;
A document issuer classification program which functions as the document issue classification apparatus according to claim 1 .