JPWO2011052025A1

JPWO2011052025A1 - Data processing apparatus, data processing method, and program

Info

Publication number: JPWO2011052025A1
Application number: JP2011538127A
Authority: JP
Inventors: 秀哉柴田; 加藤　守; 守加藤; 光則郡
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 2009-10-26
Filing date: 2009-10-26
Publication date: 2013-03-14
Anticipated expiration: 2029-10-26
Also published as: JP5220202B2; WO2011052025A1

Abstract

学習サンプル用データ抽出部１４０が複数のカテゴリにわたってサンプルメールを抽出し、学習サンプル選定部１５０が、サンプルメールの選定総数の上限値とサンプルメールの選定基準が示される学習サンプル選定規則１９０に基づき、サンプルメールの選定総数が前記上限値の範囲内で最大になりカテゴリ間の差が最小となる選定数をカテゴリ単位で計算し、学習サンプル用データ抽出部１４０により抽出されたサンプルメールの中からカテゴリ単位の選定数に従いカテゴリごとに学習部１６０の学習に用いられるサンプルメールを選定する。The learning sample data extracting unit 140 extracts sample mails over a plurality of categories, and the learning sample selecting unit 150 is based on a learning sample selection rule 190 in which an upper limit value of the total number of sample mails and sample mail selection criteria are indicated. The total number of sample mails selected is maximized within the range of the upper limit value, and the number of selections that minimizes the difference between categories is calculated for each category, and categories are selected from the sample mails extracted by the learning sample data extraction unit 140. The sample mail used for learning of the learning unit 160 is selected for each category according to the number of units selected.

Description

本発明は、データを複数のカテゴリのいずれかに分類する技術に関する。 The present invention relates to a technique for classifying data into one of a plurality of categories.

データを複数のカテゴリへ自動的に分類するための方式の１つとして、機械学習による自動分類が良く用いられる。
以下では、文書データ（以下、単に文書ともいう）を例として機械学習による自動分類技術を説明する。Automatic classification by machine learning is often used as one of the methods for automatically classifying data into a plurality of categories.
Hereinafter, an automatic classification technique based on machine learning will be described using document data (hereinafter also simply referred to as a document) as an example.

機械学習を用いた文書自動分類方式では、予め複数の分類カテゴリに分けられた学習サンプル文書を用いて、カテゴリごとの特徴を学習し、その学習結果に基づいて分類対象文書の分類を行う。
したがって、機械学習による文書分類の精度は学習サンプル文書に依存する。
特許文献１においては、学習サンプル文書を実験的に分類し、分類間違いの文書を選別して除去し、分類ルールを改善することで、分類精度を高めるための技術が開示されている。In the automatic document classification method using machine learning, features of each category are learned using learning sample documents previously divided into a plurality of classification categories, and classification target documents are classified based on the learning results.
Therefore, the accuracy of document classification by machine learning depends on the learning sample document.
Patent Document 1 discloses a technique for improving classification accuracy by experimentally classifying learning sample documents, selecting and removing misclassified documents, and improving classification rules.

特開２００２−２０２９８４号公報JP 2002-202984 A

しかしながら、特許文献１のような方法では、正しく分類された学習サンプル文書を大量に集めるには人手の手間がかかるという課題がある。
また、機械学習を用いた文書分類を適用するシステムにおいて、機械学習に使用できる時間に制約があり学習サンプル件数をユーザ側で決定したいとき、特許文献１のような方法では人手による学習サンプルの選定作業が必要となるという課題がある。However, the method as in Patent Document 1 has a problem that it takes time and labor to collect a large amount of correctly classified learning sample documents.
Also, in a system that applies document classification using machine learning, when there is a restriction on the time that can be used for machine learning and the user wants to determine the number of learning samples, in the method such as Patent Document 1, manual selection of learning samples is performed. There is a problem that work is required.

また、分類カテゴリ間の学習サンプル件数差が大きい場合、使用するアルゴリズムによっては機械学習がうまく行えず、結果として分類精度が低下するという課題がある。 Further, when the difference in the number of learning samples between classification categories is large, there is a problem that machine learning cannot be performed properly depending on the algorithm used, resulting in a decrease in classification accuracy.

また、学習サンプル文書の中に重複した内容の文書が多く含まれる場合、これらの文書を全て学習させることは非効率であり、機械学習に使用できる時間を有効活用できないという課題がある。
限られた時間内で分類精度の高い分類ルールを生成させるためには、可能な限り内容や形式の異なる文書を多く学習させることが望ましいが、特許文献１のような方法では人手による学習サンプルの選定作業が必要となるという課題がある。In addition, when there are many documents with duplicate contents in the learning sample document, it is inefficient to learn all of these documents, and there is a problem that the time available for machine learning cannot be used effectively.
In order to generate classification rules with high classification accuracy within a limited time, it is desirable to learn as many documents with different contents and formats as possible. There is a problem that selection work is required.

この発明は、上記のような課題を解決することを主な目的の一つとしており、人手による手間をかけることなく効率的な学習が行えるようなサンプルデータの選定を自動で行い、短時間の機械学習で分類精度を高めることができるデータ分類技術を提供することが目的である。 One of the main objects of the present invention is to solve the above-mentioned problems, and automatically selects sample data that can be efficiently learned without labor and effort. It is an object to provide a data classification technique that can improve classification accuracy by machine learning.

本発明に係るデータ処理装置は、
分類ルールに従って、データを複数カテゴリのうちのいずれかのカテゴリに分類する分類部と、
サンプルデータを用いた学習を行って、前記分類部が用いる分類ルールを新たに生成する学習部と、
前記学習部の学習に用いるサンプルデータをカテゴリごとに抽出するサンプルデータ抽出部と、
サンプルデータの選定総数の上限値とサンプルデータの選定基準が示されているサンプルデータ選定基準情報を記憶するサンプルデータ選定基準情報記憶部と、
前記サンプルデータ選定基準情報に基づき、サンプルデータの選定総数が前記上限値の範囲内で最大になりカテゴリ間の差が最小となる選定数をカテゴリ単位で計算し、前記サンプルデータ抽出部により抽出されたサンプルデータの中からカテゴリ単位の選定数に従いカテゴリごとに前記学習部の学習に用いるサンプルデータを選定するサンプルデータ選定部とを有することを特徴とする。The data processing apparatus according to the present invention
A classification unit that classifies data into one of a plurality of categories according to a classification rule;
A learning unit that performs learning using sample data and newly generates a classification rule used by the classification unit;
A sample data extraction unit that extracts, for each category, sample data used for learning of the learning unit;
A sample data selection criterion information storage unit for storing sample data selection criterion information indicating an upper limit of the total number of sample data selections and a selection criterion for sample data;
Based on the sample data selection criteria information, the number of selected sample data is maximized within the range of the upper limit value, and the number of selections that minimizes the difference between categories is calculated for each category and extracted by the sample data extraction unit. And a sample data selection unit that selects sample data used for learning by the learning unit for each category according to the number of selections in category units.

本発明によれば、サンプルデータの選定総数が上限値の範囲内で最大になりカテゴリ間の差が最小となる選定数をカテゴリ単位で計算し、抽出されたサンプルデータの中からカテゴリ単位の選定数に従いカテゴリごとに学習に用いるサンプルデータを選定するため、人手をかけることなく機械学習での分類精度を高めることができる。 According to the present invention, the number of selections in which the total number of sample data selections is maximized within the range of the upper limit value and the difference between categories is minimized is calculated in category units, and the selection of category units is selected from the extracted sample data. Since the sample data used for learning is selected for each category according to the number, the classification accuracy in machine learning can be increased without manpower.

実施の形態１に係るシステム構成例を示す図。FIG. 3 is a diagram illustrating an example of a system configuration according to the first embodiment. 実施の形態１に係るメールアーカイブ装置の構成例を示す図。1 is a diagram illustrating a configuration example of a mail archive device according to Embodiment 1. FIG. 実施の形態１に係るメール分類装置の動作例を示すフローチャート図。FIG. 4 is a flowchart showing an operation example of the mail classification device according to the first embodiment. 実施の形態１に係る学習サンプル選定部の動作例を示すフローチャート図。FIG. 4 is a flowchart showing an operation example of a learning sample selection unit according to the first embodiment. 実施の形態１に係るサンプルメール選定動作の具体例を示すフローチャート図。FIG. 4 is a flowchart showing a specific example of a sample mail selection operation according to the first embodiment. 実施の形態１に係るサンプルメール選定動作の具体例を示すフローチャート図。FIG. 4 is a flowchart showing a specific example of a sample mail selection operation according to the first embodiment. 実施の形態２に係る文書アーカイブ装置の構成例を示す図。FIG. 4 is a diagram illustrating a configuration example of a document archive device according to a second embodiment. 実施の形態１及び２に係るメール分類装置及び文書分類装置のハードウェア構成例を示す図。FIG. 3 is a diagram illustrating a hardware configuration example of a mail classification device and a document classification device according to the first and second embodiments.

以下の説明では、メールアーカイブシステムにおいて電子メール（以下、単にメールともいう）を分類する例と、文書アーカイブシステムにおいて文書データを分類する例を説明するが、この発明に係るデータ処理技術は、メールアーカイブシステムにおけるメール分類および文書アーカイブシステムにおける文書データの分類に限定されるものではなく、一般的な文書分類システムにおいて適用可能である。 In the following description, an example in which electronic mail (hereinafter simply referred to as mail) is classified in the mail archiving system and an example in which document data is classified in the document archive system will be described. The present invention is not limited to the mail classification in the archive system and the document data classification in the document archive system, and can be applied to a general document classification system.

また、以下では、各カテゴリの学習サンプル件数や文書情報を考慮して学習サンプル選定規則を予め定めておくことにより、人手による手間をかけることなく効率的な学習が行えるような学習サンプル文書の選定を自動で行い、短時間の機械学習で分類精度を高めることができるデータ処理技術を説明する。 In addition, in the following, by selecting the learning sample selection rules in advance considering the number of learning samples in each category and document information, selection of learning sample documents that enable efficient learning without manual labor A data processing technique that can automatically improve the classification accuracy by short-time machine learning will be described.

実施の形態１．
図１は、本実施の形態に係るメールアーカイブシステムの構成例を示す。
図１では、それぞれユーザ端末とメールサーバが含まれる３つの組織が示されている。
各組織には、各組織の識別子であるドメインが設定されている。
各組織に含まれるユーザ端末には、所属する組織のドメインが含まれるメールアドレスが設定されている。
また、メールサーバは、メールを受信するとともに、受信したメールの宛先アドレスに含まれているドメインを解析して、受信したメールの転送先を判断する。Embodiment 1 FIG.
FIG. 1 shows a configuration example of a mail archive system according to the present embodiment.
In FIG. 1, three organizations each including a user terminal and a mail server are shown.
Each organization has a domain that is an identifier of each organization.
For each user terminal included in each organization, an email address including the domain of the organization to which the user belongs is set.
Further, the mail server receives the mail and analyzes the domain included in the destination address of the received mail to determine the transfer destination of the received mail.

図１においては、便宜上、メールのドメインが対象組織ドメイン、２つの対象組織外ドメインにより構成されるが、図１のドメイン構成に限定される必要はなく、任意のドメイン構成を用いることが可能である。 In FIG. 1, for the sake of convenience, the mail domain is composed of a target organization domain and two non-target organization domains, but it is not necessary to be limited to the domain configuration of FIG. 1, and any domain configuration can be used. is there.

対象組織ドメインは、本実施の形態のメールアーカイブシステムが導入される組織３０１に固有のドメインである。
組織３０１としては、例えば、企業や、官庁、役所等の公的機関、その他団体、あるいは、その内部組織（事業所、支所など）など、固有のドメインを持つ組織が当てはまる。
また、２つの対象組織外ドメインは、共に組織３０１ではない組織３０２及び組織３０３の固有ドメインであり、互いに異なるドメインである。The target organization domain is a domain unique to the organization 301 in which the mail archive system of this embodiment is introduced.
As the organization 301, for example, an organization having a unique domain such as a company, a public organization such as a government office, a government office, other organizations, or an internal organization (business office, branch office, etc.) is applicable.
The two domains outside the target organization are unique domains of the organization 302 and the organization 303 that are not the organization 301, and are different domains.

図１の構成において、組織３０１には、メールサーバ３１１、ユーザ端末３２１が含まれる。
組織３０２には、メールサーバ３１２、ユーザ端末３２２が含まれる。
組織３０３には、メールサーバ３１３、ユーザ端末３２３が含まれる。
各組織のメールサーバ３１１、３１２、３１３は、ネットワーク３３０を通じて接続される。
ユーザ端末３２１、３２２、３２３は、メールサーバ３１１、３１２、３１３およびネットワーク３３０を通じてメールを送受信することができる。
なお、ユーザ端末数やメールサーバの構成については、図１の構成に限らず、任意のユーザ端末数、任意の構成のメールサーバを適用可能である。In the configuration of FIG. 1, the organization 301 includes a mail server 311 and a user terminal 321.
The organization 302 includes a mail server 312 and a user terminal 322.
The organization 303 includes a mail server 313 and a user terminal 323.
The mail servers 311, 312, and 313 of each organization are connected through a network 330.
The user terminals 321, 322, and 323 can send and receive mail through the mail servers 311, 312, and 313 and the network 330.
Note that the number of user terminals and the configuration of the mail server are not limited to the configuration in FIG. 1, and an arbitrary number of user terminals and a mail server with an arbitrary configuration can be applied.

図１の構成において、メールサーバ３１１は、メールアーカイブ装置２００に接続されている。
メールアーカイブ装置２００には、メール分類装置１００が含まれる。
メール分類装置１００は、データ処理装置の例である。In the configuration of FIG. 1, the mail server 311 is connected to the mail archive device 200.
The mail archive device 200 includes a mail classification device 100.
The mail classification device 100 is an example of a data processing device.

図２は、本実施の形態に係るメールアーカイブ装置２００の構成例を示す。 FIG. 2 shows a configuration example of the mail archive apparatus 200 according to the present embodiment.

メールアーカイブ装置２００は、メール蓄積用データベース２１０、およびメール分類装置１００を備える。
メールアーカイブ装置２００は、メールサーバ３１１を通過しようとする新規入力メール２０１を複製し、メール蓄積用データベース２１０に蓄積する。
メールの複製は、メールサーバ３１１で行われてもよい。
メール分類装置１００は、メール蓄積用データベース２１０に蓄積されたメールを複数の分類カテゴリに分類する。
分類結果は、分類結果蓄積用データベース１３０に蓄積される。
なお、メール蓄積用データベース２１０に蓄積されたメールには、メールを一意に識別可能とするためのメールＩＤが付与されており、このメールＩＤによりメール蓄積用データベース２１０に蓄積されたメールと分類結果蓄積用データベース１３０に蓄積されたメール分類結果とが対応付けられる。
システム管理者は、メール蓄積用データベース２１０、および分類結果蓄積用データベース１３０に問い合わせを行うことで、メールの分類結果の参照、および分類結果をキーとしたメール検索が可能である。
メール分類装置１００をシステム管理者により設定された周期毎に起動し、起動周期の間にメール蓄積用データベース２１０に蓄積されたメールを分類対象とすることで、継続的に入力されるメール全てに対して分類処理を行うような運用が可能である。The mail archive device 200 includes a mail storage database 210 and a mail classification device 100.
The mail archive device 200 duplicates the newly input mail 201 that is about to pass through the mail server 311 and stores it in the mail storage database 210.
Mail replication may be performed by the mail server 311.
The mail classification device 100 classifies the mail stored in the mail storage database 210 into a plurality of classification categories.
The classification results are accumulated in the classification result accumulation database 130.
The mail stored in the mail storage database 210 is given a mail ID for uniquely identifying the mail, and the mail stored in the mail storage database 210 and the classification result by this mail ID. The mail classification results stored in the storage database 130 are associated with each other.
The system administrator can query the mail storage database 210 and the classification result storage database 130 to refer to the mail classification result and perform mail search using the classification result as a key.
The mail classification device 100 is activated every period set by the system administrator, and the mail accumulated in the mail accumulation database 210 during the activation period is classified, so that all mails that are continuously input can be processed. It is possible to operate such that classification processing is performed.

メール分類装置１００は、分類対象データ抽出部１１０、分類部１２０、分類結果蓄積用データベース１３０、学習サンプル用データ抽出部１４０、学習サンプル選定部１５０、学習部１６０、分類ルール記憶部１７０、条件規則記憶部１９５から構成される。
条件規則記憶部１９５には、複数の分類カテゴリに対する学習サンプル抽出条件１８０、および、学習サンプル選定規則１９０が記憶されている。
学習サンプル選定規則１９０には、サンプル合計件数上限値決定規則１９１、サンプル件数決定規則１９２及びサンプル選定規則１９３が含まれる。
分類カテゴリ数に特別な制限はなく、２以上の任意の自然数が設定可能である。
学習サンプル抽出条件１８０、および、学習サンプル選定規則１９０はシステムの管理者等により設定される。The mail classification device 100 includes a classification target data extraction unit 110, a classification unit 120, a classification result storage database 130, a learning sample data extraction unit 140, a learning sample selection unit 150, a learning unit 160, a classification rule storage unit 170, and a condition rule. The storage unit 195 is configured.
The condition rule storage unit 195 stores learning sample extraction conditions 180 and learning sample selection rules 190 for a plurality of classification categories.
The learning sample selection rule 190 includes a sample total number upper limit determination rule 191, a sample number determination rule 192, and a sample selection rule 193.
There is no particular limitation on the number of classification categories, and an arbitrary natural number of 2 or more can be set.
The learning sample extraction condition 180 and the learning sample selection rule 190 are set by a system administrator or the like.

分類対象データ抽出部１１０は、メール蓄積用データベース２１０への問い合わせ文をメール蓄積用データベース２１０に発行し、問い合わせ文に対応するメールをメール蓄積用データベース２１０から分類対象メールとして抽出する。 The classification target data extraction unit 110 issues an inquiry sentence to the mail storage database 210 to the mail storage database 210, and extracts a mail corresponding to the inquiry sentence from the mail storage database 210 as a classification target mail.

分類部１２０は、学習部１６０で生成された分類ルール（分類ルール記憶部１７０に記憶されている）を使用して、分類対象データ抽出部１１０により抽出された分類対象メールを複数の分類カテゴリのうちのいずれかの分類カテゴリに分類し、分類結果をメールＩＤと関連付けて、分類結果蓄積用データベース１３０に蓄積する。 The classification unit 120 uses the classification rule generated by the learning unit 160 (stored in the classification rule storage unit 170) to convert the classification target email extracted by the classification target data extraction unit 110 into a plurality of classification categories. The data is classified into one of the classification categories, and the classification result is stored in the classification result accumulation database 130 in association with the mail ID.

学習サンプル用データ抽出部１４０は、メール蓄積用データベース２１０に蓄積済みのメールであり、かつ分類済みのメールの中から、学習部１６０の学習に使用する学習サンプルの候補となるメール（サンプルデータ）を学習サンプル用メールとしてカテゴリごとに抽出する。
なお、以下では、学習サンプル用メールをサンプルメールともいう。一方、学習部１６０の学習用に選定されたサンプルメールを学習サンプルと呼ぶ。
ある分類カテゴリの学習サンプル用メールとして抽出されるメールは、対応する分類カテゴリの学習サンプル抽出条件１８０に合致するメールである。
この際、学習サンプル用メールを新たにメール蓄積用データベース２１０から取り出すのではなく、分類対象データ抽出部１１０で抽出したメールを流用することで、メール蓄積用データベース２１０からメールを抽出する時間を削減することができる。
なお、学習サンプル用データ抽出部１４０は、サンプルデータ抽出部の例である。The learning sample data extraction unit 140 is a mail that has been stored in the mail storage database 210 and that is a candidate for a learning sample to be used for learning by the learning unit 160 from among the classified mails (sample data). Is extracted for each category as a learning sample mail.
In the following, the learning sample mail is also referred to as a sample mail. On the other hand, the sample mail selected for learning by the learning unit 160 is referred to as a learning sample.
A mail extracted as a learning sample mail for a certain category category is a mail that matches the learning sample extraction condition 180 for the corresponding category category.
At this time, instead of newly taking out the learning sample mail from the mail storage database 210, the mail extracted from the mail storage database 210 is reduced by diverting the mail extracted by the classification target data extraction unit 110. can do.
The learning sample data extraction unit 140 is an example of a sample data extraction unit.

学習サンプル抽出条件１８０に、例えば正規表現による検索式を含ませることができる。
正規表現とすることで、単純なキーワードに加え、より複雑なパターンを検索することが可能となり、学習サンプル抽出条件１８０の柔軟性を向上させることができる。The learning sample extraction condition 180 can include, for example, a search expression using a regular expression.
By using regular expressions, it becomes possible to search for more complex patterns in addition to simple keywords, and the flexibility of the learning sample extraction condition 180 can be improved.

学習サンプル用データ抽出部１４０で使用する学習サンプル抽出条件１８０はまた、メールの属性を抽出し、照合するものであっても良い。
ＲｅｑｕｅｓｔＦｏｒＣｏｍｍｅｎｔｓ（ＲＦＣ）２８２２にて定義されるヘッダフィールドやＲＦＣ２８２１にて定義されるエンベロープ、あるいは、メールサーバシステム毎に独自に定義されるヘッダフィールドなどを用いる。
ヘッダフィールドの例としては、Ｆｒｏｍ、Ｔｏ、Ｃｃ（送信者、受信者、同報受信者アドレス）や、Ｓｕｂｊｅｃｔ（件名）、Ｄａｔｅ（送信日時）、Ｒｅｃｅｉｖｅｄ（受信日時）などがある。The learning sample extraction condition 180 used in the learning sample data extraction unit 140 may also extract and collate mail attributes.
A header field defined in Request For Comments (RFC) 2822, an envelope defined in RFC2821, or a header field uniquely defined for each mail server system is used.
Examples of header fields include From, To, Cc (sender, recipient, broadcast recipient address), Subject (subject), Date (transmission date / time), Received (reception date / time), and the like.

学習サンプル用データ抽出部１４０で使用する学習サンプル抽出条件１８０はまた、メールの添付ファイルのファイル名や添付ファイル内のテキストを抽出して照合を行うものであってもよい。
ＲＦＣ２０４５−２０４９で定義されているＭｕｌｔｉｐｕｒｐｏｓｅＩｎｔｅｒｎｅｔＭａｉｌＥｘｔｅｎｓｉｏｎ（ＭＩＭＥ）の形式などによってエンコードされている添付ファイルの場合、ＭＩＭＥヘッダから添付ファイル名を抽出することが可能であり、またボディをデコードして添付ファイルを抽出し、その添付ファイルからテキストを抽出することが可能である。The learning sample extraction condition 180 used in the learning sample data extraction unit 140 may also extract the file name of the mail attachment file or the text in the attachment file and perform collation.
In the case of an attached file encoded in the format of Multipurpose Internet Mail Extension (MIME) defined in RFC2045-2049, the attached file name can be extracted from the MIME header, and the body is decoded and attached. It is possible to extract a file and extract text from the attached file.

学習サンプル選定部１５０は、学習サンプル選定規則１９０に則って、学習部１６０において実際に機械学習に使用する学習サンプルを選定する。 The learning sample selection unit 150 selects learning samples that are actually used for machine learning in the learning unit 160 in accordance with the learning sample selection rules 190.

学習サンプル選定規則１９０は、サンプル合計件数上限値決定規則１９１、サンプル件数決定規則１９２、および、サンプル選定規則１９３の３つの規則から構成される。
サンプル合計件数上限値決定規則１９１は、機械学習に使用する全分類カテゴリの学習サンプルの合計件数（サンプルデータの選定総数）の上限値が示される規則である。
サンプル件数決定規則１９２は、サンプル合計件数上限値決定規則１９１に基づいて決定された学習サンプル合計件数上限値を元に、機械学習に使用するサンプルメール件数を分類カテゴリ毎に決定するための規則である。
サンプル選定規則１９３は、サンプル件数決定規則１９２に基づいて決定された分類カテゴリ毎のサンプル件数となるように、分類カテゴリ毎のサンプルメールを選定するための規則である。
サンプル合計件数上限値決定規則１９１、サンプル件数決定規則１９２、サンプル選定規則１９３は、このように、学習サンプルの選定総数の上限値とサンプルメールの選定基準が示されており、サンプルデータ選定基準情報の例に相当する。
また、条件規則記憶部１９５は、サンプルデータ選定基準情報の例に相当するこれらの規則を記憶しており、サンプルデータ選定基準情報記憶部の例となる。The learning sample selection rule 190 includes three rules: a sample total number upper limit determination rule 191, a sample number determination rule 192, and a sample selection rule 193.
The sample total number upper limit determination rule 191 is a rule indicating the upper limit of the total number of learning samples (total number of selected sample data) of all classification categories used for machine learning.
The sample number determination rule 192 is a rule for determining the number of sample mails used for machine learning for each classification category based on the learning sample total number upper limit determined based on the sample total number upper limit determination rule 191. is there.
The sample selection rule 193 is a rule for selecting a sample mail for each classification category so that the number of samples for each classification category determined based on the sample number determination rule 192 is obtained.
The sample total number upper limit determination rule 191, the sample number determination rule 192, and the sample selection rule 193 show the upper limit of the total number of selected learning samples and the sample mail selection criteria as described above. It corresponds to the example.
Further, the condition rule storage unit 195 stores these rules corresponding to the example of the sample data selection reference information, and is an example of the sample data selection reference information storage unit.

そして、学習サンプル選定部１５０は、学習サンプル選定規則１９０に基づき、サンプルメールの選定総数が上限値の範囲内で最大になりカテゴリ間の差が最小となる選定数をカテゴリ単位で計算し、学習サンプル用データ抽出部１４０により抽出されたサンプルメールの中からカテゴリ単位の選定数に従いカテゴリごとに学習部１６０の学習に用いる学習サンプルを選定する。
上記のように、学習サンプル選定部１５０は、サンプルメールの選定総数が上限値の範囲内で最大になりカテゴリ間の差が最小となる選定数をカテゴリ単位で計算するが、「カテゴリ間の差が最小」とは、カテゴリ数が２つの場合は、２つのカテゴリの間の差が最小ということを意味する。
カテゴリ数が３つ以上の場合は、選定数が最大のカテゴリと選定数が最小のカテゴリの間の差が最小であることを意味する。
学習サンプル選定部１５０は、サンプルデータ選定部の例である。Then, based on the learning sample selection rule 190, the learning sample selection unit 150 calculates the number of selections in which the total number of sample mail selections is maximum within the range of the upper limit value and the difference between the categories is minimum, and learning is performed. A learning sample to be used for learning by the learning unit 160 is selected for each category from the sample mail extracted by the sample data extraction unit 140 according to the number of categories selected.
As described above, the learning sample selection unit 150 calculates the number of selections in which the total number of selected sample mails is maximum within the range of the upper limit value and the difference between the categories is minimum. “Minimum” means that when the number of categories is two, the difference between the two categories is minimum.
When the number of categories is three or more, it means that the difference between the category with the largest selection number and the category with the smallest selection number is the smallest.
The learning sample selection unit 150 is an example of a sample data selection unit.

学習部１６０は、学習サンプル用データ抽出部１４０、および、学習サンプル選定部１５０によりカテゴリ毎に抽出、選定されたメールをそれぞれのカテゴリの学習サンプルとして入力し、入力した学習サンプルを用いて、分類部１２０の分類に使用される分類ルールを生成する。 The learning unit 160 inputs the mail extracted and selected for each category by the learning sample data extraction unit 140 and the learning sample selection unit 150 as a learning sample of each category, and uses the input learning sample to classify A classification rule used for classification of the unit 120 is generated.

分類部１２０および学習部１６０では、一般に知られている任意の機械学習を用いた文書分類方法を用いることができる。
また、複数の機械学習を用いた文書分類方法を用いることもできる。The classification unit 120 and the learning unit 160 can use a generally known document classification method using machine learning.
A document classification method using a plurality of machine learnings can also be used.

次に、メール分類装置１００の動作を図３を用いて説明する。 Next, the operation of the mail classification device 100 will be described with reference to FIG.

システム管理者等により予め設定された起動周期にあわせて、メール分類装置１００が起動される（Ｓ１０１）。
分類対象データ抽出部１１０は、メール分類装置１００の前回起動時から今回起動時までの１周期の間にメール蓄積用データベース２１０に蓄積されたメールを分類対象メールとして抽出する（Ｓ１０２）。
分類部１２０は、分類対象メールとして抽出されたメールを分類ルールに従っていずれかの分類カテゴリへと分類し、分類結果を分類結果蓄積用データベース１３０に蓄積する（Ｓ１０３）（分類処理）。
学習サンプル用データ抽出部１４０は、分類カテゴリ毎に設定された学習サンプル抽出条件１８０を用いて学習サンプル用メールを抽出する（Ｓ１０４）（サンプルデータ抽出処理）。
学習サンプル選定部１５０は、学習サンプル選定規則１９０を条件規則記憶部１９５から読み出す（読み出し処理）とともに、学習サンプル用メールの中から実際に機械学習に使用する学習サンプルを分類カテゴリ毎に選定する（Ｓ１０５）（サンプルデータ選定処理）。
学習部１６０は、分類カテゴリ別に抽出、選定された学習サンプルを学習し、分類ルールを生成、または更新する（Ｓ１０６）（学習処理）。
以上の一連の動作を、メール分類装置１００の起動周期毎に繰り返す。The mail classification device 100 is activated in accordance with the activation cycle preset by the system administrator or the like (S101).
The classification target data extraction unit 110 extracts mail stored in the mail storage database 210 during one cycle from the previous activation of the mail classification apparatus 100 to the current activation as classification target mail (S102).
The classification unit 120 classifies the mail extracted as the classification target mail into one of the classification categories according to the classification rule, and accumulates the classification result in the classification result accumulation database 130 (S103) (classification processing).
The learning sample data extraction unit 140 extracts the learning sample mail using the learning sample extraction condition 180 set for each classification category (S104) (sample data extraction processing).
The learning sample selection unit 150 reads out the learning sample selection rule 190 from the condition rule storage unit 195 (reading process), and selects learning samples to be actually used for machine learning from the learning sample mail for each classification category ( S105) (sample data selection process).
The learning unit 160 learns learning samples extracted and selected by classification category, and generates or updates a classification rule (S106) (learning process).
The above series of operations is repeated every time the mail classification device 100 is activated.

学習サンプル選定部１５０の動作（Ｓ１０５）を図４を用いてより詳細に説明する。 The operation (S105) of the learning sample selection unit 150 will be described in more detail with reference to FIG.

まず、学習サンプル選定部１５０は、サンプル合計件数上限値決定規則１９１により、全分類カテゴリにおける学習サンプル合計件数の上限値を決定する（Ｓ２０１）。
次に、学習サンプル選定部１５０は、サンプル件数決定規則１９２により、Ｓ２０１で決定された学習サンプル合計件数上限値を元に機械学習に使用する分類カテゴリ毎のサンプルメール件数を決定する（Ｓ２０２）。
最後に、学習サンプル選定部１５０は、サンプル選定規則１９３により、Ｓ２０２で決定された分類カテゴリ毎のサンプルメール件数となるように、分類カテゴリ毎に選定するサンプルメールを決定する（Ｓ２０３）。First, the learning sample selection unit 150 determines the upper limit value of the total number of learning samples in all classification categories according to the sample total number upper limit determination rule 191 (S201).
Next, the learning sample selection unit 150 determines the number of sample mails for each classification category used for machine learning based on the learning sample total number upper limit determined in S201 according to the sample number determination rule 192 (S202).
Finally, the learning sample selection unit 150 determines a sample mail to be selected for each classification category so that the number of sample mails for each classification category determined in S202 is obtained according to the sample selection rule 193 (S203).

サンプル合計件数上限値決定規則１９１は、メール分類装置１００が機械学習に割り当て可能な時間を考慮してシステム管理者等が定める。以下に具体例を挙げる。 The sample total number upper limit determination rule 191 is determined by a system administrator or the like in consideration of the time that the mail classification device 100 can allocate to machine learning. Specific examples are given below.

メール分類装置１００が１回の機械学習に割り当てる時間を固定時間としたい場合、メール１件当たりの平均学習処理時間と機械学習に割り当て可能な総時間から機械学習可能なメール件数が算出可能である。
従って、ここで算出されたメール件数を学習サンプル合計件数の上限値として定数で与えることにより、常に、時間内に機械学習が完了することが保障される。
メールアーカイブ装置２００は、定まった起動周期毎に起動するため、起動時間によっては新規入力メール２０１の件数が学習サンプル合計件数上限値として与えた定数に満たないことがある。
この場合は、新規入力メール２０１の件数を学習サンプル合計件数の上限値として与え直せばよい。If the time that the mail classification device 100 assigns to one machine learning is set as a fixed time, the number of mails that can be machine-learned can be calculated from the average learning processing time per mail and the total time that can be assigned to machine learning. .
Therefore, by giving the number of mails calculated here as a constant as the upper limit of the total number of learning samples, it is guaranteed that machine learning is always completed in time.
Since the mail archiving apparatus 200 is activated at a predetermined activation cycle, the number of newly input mails 201 may not be equal to the constant given as the learning sample total number upper limit depending on the activation time.
In this case, the number of new input emails 201 may be given again as the upper limit value of the total number of learning samples.

また、学習サンプル合計件数の上限値を新規入力メール２０１の件数に対する割合として定めても良い。
つまり、サンプル合計件数上限値決定規則１９１に、各周期で入力された分類対象メールの件数に所定の比率を乗じた値をサンプル合計件数上限値とする規則を定義し、学習サンプル選定部１５０は、サンプル合計件数上限値決定規則１９１に基づき、周期ごとに、入力された分類対象メールの件数に所定の比率を乗じた値をサンプル合計件数上限値として計算し、計算したサンプル合計件数上限値を用いて学習サンプルを選定するようにしてもよい。
このようにすることで、学習サンプル合計件数の上限値を定数として与えるのではなく、新規入力メール２０１の件数に応じて変化させることが可能となる。
メール分類装置１００に入力される新規入力メール２０１の件数は起動周期によって異なるが、１日を通して入力されるメール件数の合計はある程度定まっている場合、上記のように学習サンプル合計件数の上限値を設定することで、１日を通してシステムが機械学習に割り当てる時間はほぼ一定値となる。
また、このように学習サンプル合計件数の上限値を設定することで、１日を通してシステムに入力されるメールから偏りなく満遍に学習サンプルを選定することが可能となる。Further, the upper limit value of the total number of learning samples may be determined as a ratio to the number of new input emails 201.
In other words, the sample total number upper limit determination rule 191 defines a rule in which a value obtained by multiplying the number of classification target emails input in each cycle by a predetermined ratio is a sample total number upper limit, and the learning sample selection unit 150 Based on the sample total number upper limit determination rule 191, a value obtained by multiplying the number of inputted classification target mails by a predetermined ratio is calculated as the sample total number upper limit for each cycle, and the calculated sample total number upper limit is calculated. You may make it select a learning sample using.
In this way, the upper limit value of the total number of learning samples is not given as a constant, but can be changed according to the number of new input mails 201.
The number of new input mails 201 input to the mail classification device 100 varies depending on the activation cycle, but when the total number of mails input throughout the day is fixed to some extent, the upper limit of the total number of learning samples is set as described above. By setting, the time allotted to machine learning by the system throughout the day becomes a substantially constant value.
In addition, by setting the upper limit value of the total number of learning samples in this way, it becomes possible to select learning samples from the mails input to the system throughout the day without any bias.

サンプル件数決定規則１９２は、システムの特性に応じてシステム管理者等が定める。以下に具体例を挙げる。
以下の説明では、分類カテゴリがカテゴリＡとカテゴリＢの２つのときの例を示すが、カテゴリ数が３以上の場合も同様の規則を与えることが可能である。The sample number determination rule 192 is determined by a system administrator or the like according to the characteristics of the system. Specific examples are given below.
In the following description, an example in which there are two classification categories, category A and category B, is shown, but the same rule can be given when the number of categories is three or more.

学習サンプル用データ抽出部１４０により抽出されるカテゴリＡのメールの件数をａ件とし、カテゴリＢのメールの件数をｂ件とする。
また、サンプル合計件数上限値決定規則１９１による上限値をｃ（≦ａ＋ｂ）件とする。
このとき、以下の基準を示すサンプル件数決定規則１９２を設け、学習サンプル選定部１５０は以下の基準により、カテゴリＡのサンプルメールの選定件数ａ’とカテゴリＢのサンプルメールの選定件数ｂ’を計算する。The number of category A mails extracted by the learning sample data extraction unit 140 is a, and the number of category B mails is b.
In addition, the upper limit according to the sample total number upper limit determination rule 191 is c (≦ a + b).
At this time, a sample number determination rule 192 indicating the following criteria is provided, and the learning sample selection unit 150 calculates the selection number a ′ of the category A sample mail and the selection number b ′ of the category B sample mail based on the following criteria. To do.

１）ａ＜ｃ／２のときに、
ａ’＝ａ
ｂ’＝ｃ−ａ
２）ｂ＜ｃ／２のときに、
ａ’＝ｂ
ｂ’＝ｃ−ｂ
３）上記１）、２）以外のときに、
ａ’＝ｃ／２
ｂ’＝ｃ／２1) When a <c / 2,
a '= a
b ′ = c−a
2) When b <c / 2,
a '= b
b ′ = c−b
3) In cases other than 1) and 2) above,
a ′ = c / 2
b ′ = c / 2

このようにすることで、学習サンプル合計件数が上限値ｃとなるという条件の下で、カテゴリＡの学習サンプル件数（ａ’）とカテゴリＢの学習サンプル件数（ｂ’）の差が最小となるようにサンプルメール件数を設定することができる。
一般に、分類カテゴリ毎の学習サンプル件数差が小さいほど、機械学習を使用した文書分類は精度が上がることが知られているため、上記規則によりメール分類の精度を高めることができる。By doing so, the difference between the number of learning samples in category A (a ′) and the number of learning samples in category B (b ′) is minimized under the condition that the total number of learning samples is the upper limit c. You can set the number of sample emails.
In general, it is known that the accuracy of document classification using machine learning increases as the difference in the number of learning samples for each classification category increases. Therefore, the accuracy of mail classification can be increased by the above rules.

なお、上述したように、学習サンプル合計件数上限値を定数とした場合は、新規入力メール２０１の件数（ａ＋ｂ）が学習サンプル合計件数上限値として与えた定数に満たないことがあるが、この場合は、上述のように、学習サンプル合計件数の上限値（ｃ）が新規入力メール２０１の件数（ａ＋ｂ）に変更される（ｃ＝ａ＋ｂ）。
また、学習サンプル合計件数上限値を新規入力メール２０１の件数に対する割合とする場合は、常にｃ≦ａ＋ｂとなる。
従って、ａ、ｂ、ｃの関係は、必ずｃ≦ａ＋ｂとなり、上記の規則を適用することができる。
このように、学習サンプル用データ抽出部１４０により抽出されたサンプルメールの抽出総数がサンプルメールの合計件数の上限値（定数）未満である場合は、サンプルメールの合計件数がサンプルメールの抽出総数の範囲内で最大になりカテゴリ間の差が最小となる選定数をカテゴリ単位で計算する。As described above, when the learning sample total number upper limit value is a constant, the number of new input emails 201 (a + b) may not be equal to the constant given as the learning sample total number upper limit value. As described above, the upper limit (c) of the total number of learning samples is changed to the number (a + b) of newly input mail 201 (c = a + b).
Further, when the upper limit value of the total number of learning samples is set as a ratio to the number of new input mails 201, c ≦ a + b is always satisfied.
Therefore, the relationship between a, b, and c is always c ≦ a + b, and the above rule can be applied.
Thus, when the total number of sample emails extracted by the learning sample data extraction unit 140 is less than the upper limit (constant) of the total number of sample emails, the total number of sample emails is the total number of sample emails extracted. The number of selections that maximizes the range and minimizes the difference between categories is calculated for each category.

また、以下の基準を示すサンプル件数決定規則１９２を設け、学習サンプル選定部１５０は以下の基準により、カテゴリＡのサンプルメールの選定件数ａ’とカテゴリＢのサンプルメールの選定件数ｂ’を計算するようにしてもよい。 In addition, a sample number determination rule 192 indicating the following criteria is provided, and the learning sample selection unit 150 calculates the selection number a ′ of the category A sample mail and the selection number b ′ of the category B sample mail based on the following criteria. You may do it.

１）ａ＜ｃ／２のときに、
ａ’＝ａ
ｂ’＝ａ
２）ｂ＜ｃ／２のとき
ａ’＝ｂ
ｂ’＝ｂ
３）上記１）、２）以外のときに、
ａ’＝ｃ／２
ｂ’＝ｃ／２1) When a <c / 2,
a '= a
b '= a
2) When b <c / 2 a ′ = b
b '= b
3) In cases other than 1) and 2) above,
a ′ = c / 2
b ′ = c / 2

これは、カテゴリＡの学習サンプル件数とカテゴリＢの学習サンプル件数が常に等しくなるという条件の下で、学習サンプル合計件数が上限値ｃに最も近づくような規則となっている。
この規則は、カテゴリＡとカテゴリＢの学習サンプルが共に十分に入手可能な場合に有効である。This is a rule such that the total number of learning samples is closest to the upper limit value c under the condition that the number of learning samples in category A is always equal to the number of learning samples in category B.
This rule is effective when both category A and category B learning samples are sufficiently available.

ａ’＝（ａ＊ｃ）／（ａ＋ｂ）
ｂ’＝（ｂ＊ｃ）／（ａ＋ｂ）a ′ = (a * c) / (a + b)
b ′ = (b * c) / (a + b)

これは、元々の学習サンプル件数ａ、ｂの比率を保ち、かつ、学習サンプル合計件数が上限値ｃと等しくなるような規則となっている。 This is a rule that maintains the ratio of the original number of learning samples a and b, and that the total number of learning samples is equal to the upper limit value c.

サンプル選定規則１９３は、システムの特性に応じてシステム管理者等が定める。
例えば、学習サンプル用データ抽出部１４０により抽出されたメールの中から、サンプル件数決定規則１９２により決定された分類カテゴリ毎のサンプル件数となるように分類カテゴリ毎のサンプルメールをランダムに選定する、あるいは、送信日時情報がサンプルメールに設けられている場合は送信日時情報に示される日時が新しいものから順に選定するなどの規則を定めることができる。The sample selection rule 193 is determined by the system administrator or the like according to the characteristics of the system.
For example, from among the emails extracted by the learning sample data extraction unit 140, randomly select a sample email for each classification category so as to be the number of samples for each classification category determined by the sample count determination rule 192, or When the transmission date / time information is provided in the sample mail, it is possible to define a rule such that the date / time indicated in the transmission date / time information is selected in order from the newest.

好ましくは、各分類カテゴリにおいて、同一の属性（例えば、メールの件名（タイトル））を持つサンプルメールが選定される回数を最小にし、かつ、各属性を持つサンプルメール件数の分散が最小となるように、という条件の下で、サンプルメールを選定してもよい。
つまり、学習サンプル用データ抽出部１４０が、一つのカテゴリに対して、複数種の属性（メールの件名（タイトル））を持つ複数のサンプルメールを抽出し、学習サンプル選定部１５０が、カテゴリごとに、指定されている選定数の範囲内でサンプルメールの属性の種類数が最大となり属性間のサンプルメール選定数の差が最小となるように学習サンプルメールを選定する。
ここで、「属性間のサンプルメール選定数の差が最小」とは、属性の種類数が２つの場合は、２つのカテゴリの間の差が最小ということを意味する。
属性の種類数が３つ以上の場合は、選定数が最大の属性と選定数が最小の属性の間の差が最小であることを意味する。
属性の例をメールの件名とする場合における具体的な動作例を図５を用いて説明する。Preferably, in each classification category, the number of times sample mail having the same attribute (for example, mail subject (title)) is selected is minimized, and the distribution of the number of sample mails having each attribute is minimized. In addition, a sample mail may be selected under the condition.
In other words, the learning sample data extraction unit 140 extracts a plurality of sample emails having a plurality of types of attributes (email subject (title)) for one category, and the learning sample selection unit 150 selects each category. The learning sample mail is selected so that the number of sample mail attribute types is maximized and the difference in the number of sample mail selections between attributes is minimized within the specified number of selections.
Here, “the difference in the number of selected sample mails between attributes is minimum” means that the difference between the two categories is minimum when the number of attribute types is two.
When the number of attribute types is three or more, it means that the difference between the attribute with the largest selection number and the attribute with the smallest selection number is the smallest.
A specific operation example in the case where the attribute example is the mail subject will be described with reference to FIG.

図５では、あるカテゴリからＮ件のメールを選定する動作を説明する。
学習サンプル用データ抽出部１４０によりこのカテゴリのサンプルとして抽出されたメール群をＭで表す。
まず、学習サンプル選定部１５０は、同一件名のメールを２回以上選定しないようにメールを最大数選定する（Ｓ３０１）。
選定したメールの件数がＮ以上であれば（Ｓ３０２）、学習サンプル選定部１５０は、Ｎ件になるようにランダムにメールを選定し終了する（Ｓ３０３）。
選定したメールの件数がＮより小さければ（Ｓ３０２）、学習サンプル選定部１５０は、選定したメールを選定済みと確定し（Ｓ３０４）、Ｍから確定済みメールの件数を除外し（Ｓ３０５）、Ｓ３０１へ戻る。
そして、学習サンプル選定部１５０は、確定済みメールの件数がＮになるまで処理を繰り返す。FIG. 5 illustrates an operation of selecting N mails from a certain category.
The mail group extracted as a sample of this category by the learning sample data extraction unit 140 is represented by M.
First, the learning sample selection unit 150 selects the maximum number of emails so as not to select emails with the same subject more than once (S301).
If the number of selected mails is N or more (S302), the learning sample selection unit 150 randomly selects mails so as to be N mails and ends (S303).
If the number of selected emails is smaller than N (S302), the learning sample selection unit 150 determines that the selected email has been selected (S304), excludes the number of confirmed emails from M (S305), and proceeds to S301. Return.
And the learning sample selection part 150 repeats a process until the number of confirmed mail becomes N.

このように、学習サンプル選定部１５０は、カテゴリごとに、複数のサンプルメールの中から、全ての属性（件名）について属性ごとに１つのサンプルメールを選択し、選択したサンプルメールの数がカテゴリに指定されている選定数（上記の例ではＮ）以上であれば、選択したサンプルメールの中から選定数に一致するようにランダムに学習に用いるサンプルデータを選定する（Ｓ３０３）。
一方、選択したサンプルメールの数がカテゴリに指定されている選定数未満であれば、学習サンプル選定部１５０は、選択したサンプルメールを学習に用いるサンプルメールとして選定する（Ｓ３０４）とともに、未選択のサンプルメールの中から未選択のサンプルメールに含まれている全ての属性について属性ごとに１つのサンプルメールを選択して不足分のサンプルメールを選定する。As described above, the learning sample selection unit 150 selects one sample mail for each attribute for all attributes (subjects) from a plurality of sample mails for each category, and the number of selected sample mails is included in the category. If it is equal to or greater than the designated selection number (N in the above example), sample data used for learning is randomly selected from the selected sample mail so as to match the selection number (S303).
On the other hand, if the number of selected sample emails is less than the number of selections specified in the category, the learning sample selection unit 150 selects the selected sample emails as sample emails used for learning (S304), and unselected One sample mail is selected for each attribute for all the attributes included in the unselected sample mail from the sample mails, and a shortage of sample mails is selected.

同一件名を持つメールは酷似した内容を持つ可能性が高いため、上記のような選定規則により重複した内容の学習サンプルを排除し、結果として、学習サンプルとして選定されるメールが多岐に渡る話題を含むようにすることが可能となる。 Since emails with the same subject are likely to have very similar contents, the above-mentioned selection rule eliminates duplicate learning samples, resulting in a wide variety of topics selected as learning samples. It can be included.

あるいは、サンプルメールの添付ファイルが持つ拡張子を属性の例としてもよい。
つまり、選定されるサンプルメールの添付ファイルが持つ拡張子の種類（アプリケーションプログラムの種類）が最多となり、かつ、各拡張子を持つファイルを添付したサンプルメール件数の分散が最小となるようにサンプルメールを選定してもよい。
同一拡張子を持つファイルは文書形式が類似している可能性が高いため、上記のような選定規則により学習サンプルとして選定されるメールが多様な形式の添付ファイルを含むようにすることが可能となる。Alternatively, the extension of the attached file of the sample mail may be used as an example of the attribute.
In other words, the sample mail attachments to be selected have the largest number of extension types (application program types), and the sample mails so that the distribution of the number of sample mails with files with each extension attached is minimized. May be selected.
Since files with the same extension are likely to have similar document formats, emails selected as learning samples by the selection rules as described above can include attachments in various formats. Become.

あるいは、選定されるサンプルメールが受信メールである場合は、ヘッダフィールドのＦｒｏｍに記載されたメールアドレスのドメインを属性の例としてもよい。
つまり、ヘッダフィールドのＦｒｏｍに記載されたメールアドレスのドメインの種類が最多となり、かつ、各ドメインを持つメールアドレスが記載されたサンプルメール件数の分散が最小となるようにサンプルメールを選定してもよい。
また、選定されるサンプルメールが送信メールである場合は、ヘッダフィールドのＴｏに記載されたメールアドレスのドメインを属性の例としてもよい。
つまり、ヘッダフィールドのＴｏに記載されたメールアドレスのドメインの種類が最多となり、かつ、各ドメインを持つメールアドレスが記載されたサンプルメール件数の分散が最小となるようにサンプルメールを選定してもよい。
同一ドメインが記載された電子メールは内容が類似している可能性が高いため、上記のような選定規則により学習サンプルとして選定されるメールが多岐に渡る話題を含むようにすることが可能となる。Alternatively, when the selected sample mail is a received mail, the domain of the mail address described in the From field of the header field may be used as an example of the attribute.
In other words, even if the sample mail is selected so that the number of domain types of the mail address described in the From field of the header field is the largest and the distribution of the number of sample mails including the mail addresses having each domain is minimized. Good.
When the selected sample mail is a transmission mail, the domain of the mail address described in the header field To may be used as an example of the attribute.
In other words, even if the sample mail is selected so that the domain type of the mail address described in the header field To is the largest, and the distribution of the number of sample mails including the mail address having each domain is minimized. Good.
E-mails with the same domain are likely to be similar in content, so it is possible to include a wide variety of topics in e-mails selected as learning samples by the above selection rules. .

あるいは、サンプルメールのヘッダフィールドＣｏｎｔｅｎｔ−ｔｙｐｅに含まれるｃｈａｒｓｅｔ記載の文字コードの種類を属性の例としてもよい。
つまり、選定されるサンプルメールのヘッダフィールドＣｏｎｔｅｎｔ−ｔｙｐｅに含まれるｃｈａｒｓｅｔ記載の文字コードの種類が最多となり、かつ、各文字コードが記載されたサンプルメール件数の分散が最小となるようにサンプルメールを選定してもよい。
同一文字コードが記載された電子メールは同一言語圏で作成された可能性が高いため、上記のような選定規則により学習サンプルとして選定されるメールが多様な言語を含むようにすることが可能となる。Alternatively, the character code type described in charset included in the header field Content-type of the sample mail may be used as an example of the attribute.
In other words, the sample mail is set so that the number of character codes described in the charset included in the header field Content-type of the selected sample mail is the largest, and the distribution of the number of sample mails in which each character code is written is minimized. You may choose.
Since emails with the same character code are likely to have been created in the same language area, emails selected as learning samples by the selection rules as described above can include various languages. Become.

上記に挙げたサンプル選定規則は優先順位を考慮した上で、組み合わせて設定しても良い。
例えば、多様な件名を持つサンプルメールの選定を最優先し、第二に多様なドメイン、第三に多様な文字コードを持つサンプルメールの選定を優先したい場合の動作を図６を用いて説明する。The sample selection rules listed above may be set in combination in consideration of the priority order.
For example, the operation when priority is given to the selection of sample mails having various subjects, secondly to the selection of sample mails having various domains, and thirdly, various character codes will be described with reference to FIG. .

まず、学習サンプル選定部１５０は、図５のＳ３０１、Ｓ３０２、Ｓ３０４、Ｓ３０５の動作を繰り返し、多様な件名のメールを含むようなサンプルメール選定の候補を挙げる（Ｓ４０１）。
候補が１つの場合は（Ｓ４０２）、学習サンプル選定部１５０は、選定メールを確定させて終了する。
候補が複数存在する場合は（Ｓ４０２）、学習サンプル選定部１５０は、その候補の中から多様なドメインを含むように候補を絞り込む（Ｓ４０３）。
更に、候補が複数存在する場合（Ｓ４０４）は、学習サンプル選定部１５０は、多様な文字コードを含むように候補を絞り込む（Ｓ４０５）。
更に、候補が複数存在する場合（Ｓ４０６）は、学習サンプル選定部１５０は、ランダムに候補を一つ選択して終了する（Ｓ４０７）。First, the learning sample selection unit 150 repeats the operations of S301, S302, S304, and S305 of FIG. 5 and lists sample mail selection candidates that include various subject mails (S401).
When there is one candidate (S402), the learning sample selection unit 150 determines the selection mail and ends.
When there are a plurality of candidates (S402), the learning sample selection unit 150 narrows down the candidates so as to include various domains from the candidates (S403).
Furthermore, when there are a plurality of candidates (S404), the learning sample selection unit 150 narrows down the candidates to include various character codes (S405).
Further, when there are a plurality of candidates (S406), the learning sample selection unit 150 randomly selects one candidate and ends (S407).

さらに好ましくは、メールのテキストサイズがシステム管理者等により予め定められた値以下の小さいメールは、学習サンプルとして選定しないようにしてもよい。
例えば、テキストに含まれる文字数が１０文字以下のメールを選定しないようにしたい場合、テキストがＵＴＦ−８（８−ｂｉｔＵＣＳＴｒａｎｓｆｏｒｍａｔｉｏｎＦｏｒｍａ）によりエンコードされていれば、１文字平均３バイトと考えて３０バイト以下のメールを選定しないようにする、と定めればよい。
サイズが小さいテキストファイルは特徴量をあまり持たないため、使用するアルゴリズムによっては機械学習がうまく行えず、結果として分類精度が低下するということがある。
従って、学習サンプルからサイズが小さいメールを除外することにより、分類精度を高めることが可能となる。More preferably, a mail whose text size is smaller than a value predetermined by a system administrator or the like may not be selected as a learning sample.
For example, when it is desired not to select a mail whose number of characters included in the text is 10 or less, if the text is encoded by UTF-8 (8-bit UCS Transformation Format), it is assumed that the average of 3 bytes per character is 30. You may decide not to select emails that are less than bytes.
Since a text file with a small size does not have much feature quantity, machine learning cannot be performed properly depending on the algorithm used, and as a result, the classification accuracy may be lowered.
Therefore, it is possible to improve the classification accuracy by excluding small mails from the learning sample.

以上で述べたように、実施の形態１においては、メール分類装置１００で使用する学習サンプルを、予め設定された学習サンプル選定規則１９０により自動的に選定することで、人手による学習サンプル作成の手間をかけることなく効率的な学習が行えるような学習サンプル選定が可能となり、結果、短時間の機械学習で分類精度を高める文書分類装置を提供できる。 As described above, in the first embodiment, the learning sample to be used in the mail classification device 100 is automatically selected according to the preset learning sample selection rule 190, so that labor for creating the learning sample manually is reduced. It is possible to select a learning sample so that efficient learning can be performed without applying a mark, and as a result, it is possible to provide a document classification device that improves classification accuracy by short-time machine learning.

実施の形態２．
図７は、本実施の形態に係る文書分類装置１００ｂを適用した文書アーカイブシステムを示す構成図である。
文書アーカイブ装置２００ｂとメールアーカイブ装置２００の違いは、入力が文書ファイルが電子メールかの違いである。
つまり、図７に示すように、文書分類装置１００ｂ及び文書アーカイブ装置２００ｂの構成は、図１に示すメール分類装置１００及びメールアーカイブ装置２００と実質的に同じである。
以下、文書アーカイブ装置２００ｂの動作を、メールアーカイブ装置２００との相違点に絞って説明する。Embodiment 2. FIG.
FIG. 7 is a configuration diagram showing a document archive system to which the document classification apparatus 100b according to the present embodiment is applied.
The difference between the document archive device 200b and the mail archive device 200 is that the input is a document file or an electronic mail.
That is, as shown in FIG. 7, the configuration of the document classification device 100b and the document archive device 200b is substantially the same as that of the mail classification device 100 and the mail archive device 200 shown in FIG.
Hereinafter, the operation of the document archive device 200b will be described focusing on differences from the mail archive device 200.

文書アーカイブ装置２００ｂは、文書蓄積用データベース２１０ｂ、および文書分類装置１００ｂを備える。
文書アーカイブ装置２００ｂは、新規入力文書２０１ｂを複製し、文書蓄積用データベース２１０ｂに蓄積する。
文書蓄積用データベース２１０ｂに蓄積された文書には、文書を一意に識別可能とするための文書ＩＤが付与されており、この文書ＩＤにより文書蓄積用データベース２１０ｂに蓄積された文書と分類結果蓄積用データベース１３０に蓄積された文書分類結果とが対応付けられる。
システム管理者は文書蓄積用データベース２１０ｂ、および分類結果蓄積用データベース１３０に問い合わせを行うことで、文書の分類結果の参照、および分類結果をキーとした文書検索が可能である。
文書分類装置１００ｂをシステム管理者により設定された周期毎に起動し、起動周期の間に文書蓄積用データベース２１０ｂに蓄積された文書を分類対象とすることで、継続的に入力される文書全てに対して分類処理を行うような運用が可能である。The document archive device 200b includes a document storage database 210b and a document classification device 100b.
The document archive device 200b duplicates the new input document 201b and stores it in the document storage database 210b.
The document stored in the document storage database 210b is assigned a document ID for uniquely identifying the document, and the document stored in the document storage database 210b by this document ID and the classification result storage. The document classification results stored in the database 130 are associated with each other.
The system administrator makes an inquiry to the document storage database 210b and the classification result storage database 130, thereby making it possible to refer to the document classification result and perform a document search using the classification result as a key.
The document classification apparatus 100b is activated every period set by the system administrator, and the documents accumulated in the document accumulation database 210b during the activation period are classified, so that all documents input continuously It is possible to operate such that classification processing is performed.

文書蓄積用データベース２１０ｂには、蓄積文書に関連付けられた文書ファイル名やその拡張子、文書作成日時、作成者などの付加情報を保持していても良い。
これにより、これらの付加情報を学習サンプル選定規則１９０で使用することが可能となる。The document storage database 210b may hold additional information such as a document file name associated with the stored document, its extension, document creation date and time, creator, and the like.
This makes it possible to use these additional information in the learning sample selection rule 190.

以下では、学習サンプル選定規則１９０のうち、サンプル選定規則１９３の設定方法を説明する。その他の規則はメールアーカイブシステムのときと同様である。 Below, the setting method of the sample selection rule 193 among the learning sample selection rules 190 is demonstrated. Other rules are the same as in the mail archive system.

サンプル選定規則１９３は、システムの特性に応じてシステム管理者等が定める。
例えば、学習サンプル用データ抽出部１４０により抽出された文書の中から、サンプル件数決定規則１９２により決定された分類カテゴリ毎のサンプル件数となるように分類カテゴリ毎のサンプル文書をランダムに選定する、あるいは、文書作成日時情報がサンプル文書に設けられている場合は文書作成日時情報に示される日時が新しいものから順に選定するなどの規則を定めることができる。The sample selection rule 193 is determined by the system administrator or the like according to the characteristics of the system.
For example, a sample document for each classification category is randomly selected from the documents extracted by the learning sample data extraction unit 140 so that the number of samples for each classification category determined by the sample number determination rule 192 is obtained, or When the document creation date / time information is provided in the sample document, it is possible to define a rule such that the date / time indicated in the document creation date / time information is selected in order from the newest.

また、メールアーカイブシステムと同様にサンプル文書の属性に基づいてサンプル文書を選定することが考えられる。
具体的には、サンプル文書のファイル名（タイトル）を属性の例とし、各分類カテゴリにおいて、同一ファイル名を持つサンプル文書が選定される回数を最小にし、かつ、各ファイル名のサンプル文書件数の分散が最小となるように、という条件の下で、サンプル文書を選定してもよい。
同一ファイル名を持つ文書は酷似した内容を持つ可能性が高いため、上記のような選定規則により重複した内容の学習サンプルを排除し、結果として、学習サンプルとして選定される文書が多岐に渡る話題を含むようにすることが可能となる。Further, it is conceivable to select a sample document based on the attribute of the sample document as in the mail archive system.
Specifically, the file name (title) of the sample document is an example of the attribute, the number of times the sample document having the same file name is selected in each classification category is minimized, and the number of sample documents of each file name is A sample document may be selected under the condition that the variance is minimized.
Since documents with the same file name are likely to have very similar contents, the learning samples with duplicate contents are eliminated by the selection rules as described above, and as a result, various topics are selected as the learning samples. Can be included.

あるいは、選定されるサンプル文書が持つ拡張子の種類（アプリケーションプログラムの種類）が最多となり、かつ、各拡張子を持つサンプル文書件数の分散が最小となるようにサンプル文書を選定してもよい。
同一拡張子を持つファイルは文書形式が類似している可能性が高いため、上記のような選定規則により学習サンプルとして選定される文書が多様な形式のファイルを含むようにすることが可能となる。Alternatively, the sample documents may be selected so that the number of types of extension (types of application programs) of the selected sample document is the largest and the distribution of the number of sample documents having each extension is minimized.
Since files with the same extension are likely to have similar document formats, it is possible to include documents of various formats that are selected as learning samples by the above selection rules. .

あるいは、選定されるサンプル文書の作成者の数が最多となり、かつ、各作成者が作成したサンプル文書件数の分散が最小となるようにサンプル文書を選定してもよい。
同一人物が作成した文書は扱う話題が類似している可能性が高いため、上記のような選定規則により学習サンプルとして選定される文書が多岐に渡る話題を含むようにすることが可能となる。Alternatively, the sample documents may be selected so that the number of creators of sample documents to be selected is the largest and the variance of the number of sample documents created by each creator is minimized.
Since documents created by the same person are likely to have similar topics, documents selected as learning samples by the above selection rules can include a wide variety of topics.

上記に挙げたサンプル文書選定規則は優先順位を考慮した上で、組み合わせて設定しても良い。
例えば、多様なファイル名を持つサンプル文書の選定を最優先し、第二に多様な拡張子を持つサンプル文書の選定を優先するといった規則の設定が可能である。The sample document selection rules listed above may be set in combination in consideration of the priority order.
For example, it is possible to set rules such that selection of sample documents having various file names has the highest priority, and secondly selection of sample documents having various extensions.

以上で述べたように、実施の形態２においては、文書分類装置１００ｂで使用する学習サンプルを、予め設定された学習サンプル選定規則１９０により自動的に選定することで、人手による学習サンプル作成の手間をかけることなく効率的な学習が行えるような学習サンプル選定が可能となり、結果、短時間の機械学習で分類精度を高める文書分類装置を提供できる。 As described above, in the second embodiment, the learning sample to be used in the document classification device 100b is automatically selected according to a preset learning sample selection rule 190, so that labor for creating the learning sample manually is reduced. It is possible to select a learning sample so that efficient learning can be performed without applying a mark, and as a result, it is possible to provide a document classification device that improves classification accuracy by short-time machine learning.

以上、実施の形態１及び２では、入力された文書を複数の分類カテゴリに分類する文書分類装置であって、
複数のサンプル文書を入力する手段と、
予め分類カテゴリ毎に設定されたサンプル抽出条件により、サンプル抽出条件に合致するサンプル文書を対応する分類カテゴリのサンプル文書として抽出する手段と、
抽出されたサンプル文書から、予め設定しておいた学習サンプル選定規則に従って、各分類カテゴリで使用する学習サンプル文書を選定する手段と、
分類カテゴリ毎に選定された学習サンプル文書を用いて少なくとも１つのアルゴリズムによる機械学習を行うことにより、分類ルールを生成または更新する手段と、
１つ以上の分類対象文書を入力する手段と、
前記分類ルールを用いて、入力された分類対象文書を複数の分類カテゴリに分類する手段と、
分類対象文書の分類結果を出力する手段とを有する文書分類装置を説明した。As described above, in Embodiments 1 and 2, the document classification apparatus classifies the input document into a plurality of classification categories.
Means for inputting a plurality of sample documents;
Means for extracting a sample document matching the sample extraction condition as a sample document of the corresponding classification category according to the sample extraction condition set in advance for each classification category;
Means for selecting a learning sample document to be used in each classification category from the extracted sample document according to a learning sample selection rule set in advance;
Means for generating or updating a classification rule by performing machine learning using at least one algorithm using a learning sample document selected for each classification category;
Means for inputting one or more documents to be classified;
Means for classifying the input classification target document into a plurality of classification categories using the classification rule;
A document classification apparatus having means for outputting a classification result of a classification target document has been described.

また、実施の形態１及び２では、
前記学習サンプル選定規則は、
機械学習に使用する全分類カテゴリの学習サンプル合計件数の上限値を決定するサンプル合計件数上限値決定規則と、
サンプル合計件数上限値決定規則により決定された学習サンプル合計件数上限値を元に、機械学習に使用する分類カテゴリ毎のサンプル件数を決定するサンプル件数決定規則と、
サンプル件数決定規則により決定された分類カテゴリ毎のサンプル件数となるように、分類カテゴリ毎のサンプル文書を選定するサンプル選定規則と
の３つの規則からなり、
前記学習サンプル文書を選定する手段は、
サンプル合計件数上限値決定規則により、機械学習に使用する全分類カテゴリの学習サンプル合計件数上限値を決定する手段と、
サンプル件数決定規則により、機械学習に使用する分類カテゴリ毎のサンプル件数を決定する手段と、
サンプル選定規則により、分類カテゴリ毎に抽出されたサンプル文書を選定する手段と、
から成ることを説明した。In the first and second embodiments,
The learning sample selection rule is:
A sample total number upper limit determination rule that determines the upper limit of the total number of learning samples for all classification categories used for machine learning,
Based on the learning sample total number upper limit value determined by the sample total number upper limit determination rule, the sample number determination rule that determines the number of samples for each classification category used for machine learning,
It consists of three rules, the sample selection rule that selects sample documents for each classification category, so that the number of samples for each classification category determined by the sample number determination rule,
The means for selecting the learning sample document is:
A method for determining the upper limit for the total number of learning samples for all classification categories used for machine learning according to the rule for determining the upper limit for the total number of samples,
A method for determining the number of samples for each classification category to be used for machine learning according to the sample number determination rule,
A means for selecting sample documents extracted for each classification category according to sample selection rules;
Explained that it consists of.

また、実施の形態１及び２では、
前記サンプル合計件数上限値決定規則は、
機械学習に使用する全分類カテゴリの学習サンプル合計件数上限値をある定数として指定し、
前記複数のサンプル文書を入力する手段において入力されたサンプル件数が学習サンプル合計件数上限値として指定した定数に満たない場合に限り、学習サンプル合計件数上限値を入力されたサンプル件数として指定する規則であることを説明した。In the first and second embodiments,
The sample total number upper limit determination rule is:
Specify the upper limit of the total number of learning samples for all classification categories used for machine learning as a constant,
A rule that specifies the learning sample total number upper limit value as the input sample number only when the number of samples input in the means for inputting the plurality of sample documents is less than the constant specified as the learning sample total number upper limit value. Explained that there is.

また、実施の形態１及び２では、
前記サンプル合計件数上限値決定規則は、
機械学習に使用する全分類カテゴリの学習サンプル合計件数上限値を、前記複数のサンプル文書を入力する手段において入力されたサンプル件数に予め定めておいたサンプル文書使用率を乗じた値として決定する規則であることを説明した。In the first and second embodiments,
The sample total number upper limit determination rule is:
Rule for determining the upper limit of the total number of learning samples of all classification categories used for machine learning as a value obtained by multiplying the number of samples input by the means for inputting a plurality of sample documents by a predetermined sample document usage rate I explained that.

また、実施の形態１及び２では、
前記サンプル件数決定規則は、
各分類カテゴリのサンプル件数の合計が前記学習サンプル合計件数上限値と等しくなる、という条件の下で、サンプル文書が最も多いカテゴリと最も少ないカテゴリでのサンプル件数の差が最小となるように、かつ、各分類カテゴリのサンプル件数の分散が最小となるように、各分類カテゴリのサンプル件数を決定する規則であることを説明した。In the first and second embodiments,
The sample number determination rule is:
Under the condition that the total number of samples in each classification category is equal to the upper limit of the total number of learning samples, the difference between the number of samples in the category with the most sample documents and the category with the least number of samples is minimized, and It has been explained that the rule determines the number of samples in each classification category so that the variance of the number of samples in each classification category is minimized.

また、実施の形態１及び２では、
前記サンプル件数決定規則は、
各分類カテゴリのサンプル件数が全て等しくなる、という条件の下で、各分類カテゴリのサンプル件数の合計が前記学習サンプル合計件数上限値に最も近づくように、各分類カテゴリのサンプル件数を決定する規則であることを説明した。In the first and second embodiments,
The sample number determination rule is:
A rule that determines the number of samples for each classification category so that the total number of samples for each classification category is closest to the upper limit for the total number of learning samples under the condition that the number of samples for each classification category is all equal. Explained that there is.

また、実施の形態１及び２では、
前記サンプル件数決定規則は、
各分類カテゴリのサンプル件数の合計が前記学習サンプル合計件数上限値と等しくなる、という条件の下で、前記複数のサンプル文書を入力する手段において入力された各分類カテゴリのサンプル文書の件数比率と、各分類カテゴリのサンプル文書として抽出されたサンプル文書の件数比率が等しくなるように、各分類カテゴリのサンプル件数を決定する規則であることを説明した。In the first and second embodiments,
The sample number determination rule is:
Under the condition that the total number of samples of each classification category is equal to the upper limit of the total number of learning samples, the ratio of the number of sample documents of each classification category input in the means for inputting the plurality of sample documents; It has been explained that the rule is to determine the number of samples of each classification category so that the ratio of the number of sample documents extracted as sample documents of each classification category becomes equal.

また、実施の形態１及び２では、
前記サンプル選定規則は、各分類カテゴリのサンプル文書をランダムに選定する規則であることを説明した。In the first and second embodiments,
It has been explained that the sample selection rule is a rule for randomly selecting sample documents of each classification category.

また、実施の形態１及び２では、
前記文書分類装置は入力されたサンプル文書の作成日時に関する情報を保有しており、
前記サンプル選定規則は、作成日時が新しいサンプル文書から順に選び出す規則であることを説明した。In the first and second embodiments,
The document classification device has information regarding the creation date and time of the input sample document,
It has been explained that the sample selection rule is a rule for selecting a sample document in order of creation date and time.

また、実施の形態１では、
前記サンプル文書と前記分類対象文書が電子メールであって、
前記サンプル選定規則は、送信日時が新しいサンプル電子メールから順に選び出す規則であることを説明した。In the first embodiment,
The sample document and the classification target document are emails,
It has been explained that the sample selection rule is a rule for selecting a sample e-mail in order of transmission date and time.

また、実施の形態２では、
前記文書分類装置は入力されたサンプル文書の文書ファイル名に関する情報を保有しており、
前記サンプル選定規則は、各分類カテゴリにおいて、同一文書ファイル名を持つサンプル文書が選定される回数を最小にし、かつ、各文書ファイル名を持つサンプル件数の分散が最小となるようにサンプル文書を選定する規則であることを説明した。In the second embodiment,
The document classification device has information about the document file name of the input sample document,
The sample selection rule selects the sample document so that the number of times the sample document having the same document file name is selected in each classification category is minimized and the distribution of the number of samples having each document file name is minimized. Explained that it is a rule to do.

また、実施の形態１では、
前記サンプル文書と前記分類対象文書が電子メールであって、
前記サンプル選定規則は、
各分類カテゴリにおいて、同一件名を持つサンプル電子メールが選定される回数を最小にし、かつ、各件名を持つサンプル電子メール件数の分散が最小となるようにサンプル電子メールを選定する規則であることを説明した。In the first embodiment,
The sample document and the classification target document are emails,
The sample selection rules are:
It is a rule to select sample emails that minimize the number of times sample emails with the same subject are selected for each category and minimize the variance of the number of sample emails with each subject. explained.

また、実施の形態２では、
前記文書分類装置は入力されたサンプル文書に対して作成アプリケーションの種類を示すアプリケーション情報を保有しており、
前記サンプル選定規則は、各分類カテゴリにおいて、選定されるサンプル文書が持つアプリケーション情報の種類が最多となり、かつ、各アプリケーション情報を持つサンプル件数の分散が最小となるようにサンプル文書を選定する規則であることを説明した。In the second embodiment,
The document classification device has application information indicating the type of created application for the input sample document,
The sample selection rule is a rule for selecting a sample document so that the number of types of application information held by the sample document selected in each classification category is the largest and the variance of the number of samples having each application information is minimized. Explained that there is.

また、実施の形態１では、
前記サンプル文書と前記分類対象文書が電子メールであって、
前記サンプル選定規則は、各分類カテゴリにおいて、選定されるサンプル電子メールの添付ファイルが持つアプリケーション情報の種類が最多となり、かつ、各アプリケーション情報を持つファイルを添付したサンプル電子メール件数の分散が最小となるようにサンプル電子メールを選定する規則であることを説明した。In the first embodiment,
The sample document and the classification target document are emails,
According to the sample selection rule, in each classification category, the type of application information included in the selected sample email attachment is the largest, and the distribution of the number of sample emails attached with the file having each application information is minimized. I explained that it is a rule to select sample e-mails.

また、実施の形態２では、
前記文書分類装置は入力されたサンプル文書の作成者を示す情報を保有しており、
前記サンプル選定規則は、各分類カテゴリにおいて、選定されるサンプル文書の作成者の数が最多となり、かつ、各作成者が作成したサンプル件数の分散が最小となるようにサンプル文書を選定する規則であることを説明した。In the second embodiment,
The document classification device has information indicating the creator of the input sample document,
The sample selection rule is a rule for selecting a sample document so that the number of creators of sample documents selected in each classification category is the largest and the variance of the number of samples created by each creator is minimized. Explained that there is.

また、実施の形態１では、
前記サンプル文書と前記分類対象文書が電子メールであって、
前記サンプル選定規則は、各分類カテゴリにおいて、電子メールの特定のヘッダフィールドに記載されたメールアドレスに関して、同一ドメインを持つサンプル電子メールが選定される回数を最小にし、かつ、各ドメインを持つサンプル電子メール件数の分散が最小となるようにサンプル電子メールを選定する規則であることを説明した。In the first embodiment,
The sample document and the classification target document are emails,
The sample selection rule minimizes the number of times that a sample email having the same domain is selected for the email address described in a specific header field of the email in each classification category, and the sample electronic having each domain. We explained that it is a rule to select sample emails so that the distribution of the number of emails is minimized.

また、実施の形態１では、
前記サンプル文書と前記分類対象文書が電子メールであって、
前記サンプル選定規則は、各分類カテゴリにおいて、同一文字コードで作成されたサンプル電子メールが選定される回数を最小にし、かつ、同一文字コードで作成されたサンプル電子メール件数の分散が最小となるようにサンプル電子メールを選定する規則であることを説明した。In the first embodiment,
The sample document and the classification target document are emails,
The sample selection rule minimizes the number of times sample emails created with the same character code are selected in each classification category and minimizes the variance of the number of sample emails created with the same character code. Explained that it is a rule to select sample emails.

また、実施の形態１及び２では、
前記サンプル選定規則は、前記サンプル文書のサイズが予め定められた値以下の文書をサンプル文書として選定しないための規則であることを説明した。In the first and second embodiments,
It has been explained that the sample selection rule is a rule for not selecting a document whose size is less than or equal to a predetermined value as a sample document.

最後に、実施の形態１及び２に示したメール分類装置１００及び文書分類装置１００ｂのハードウェア構成例について説明する。
図８は、実施の形態１及び２に示すメール分類装置１００及び文書分類装置１００ｂのハードウェア資源の一例を示す図である。
なお、図８の構成は、あくまでもメール分類装置１００及び文書分類装置１００ｂのハードウェア構成の一例を示すものであり、メール分類装置１００及び文書分類装置１００ｂのハードウェア構成は図８に記載の構成に限らず、他の構成であってもよい。Finally, a hardware configuration example of the mail classification device 100 and the document classification device 100b described in the first and second embodiments will be described.
FIG. 8 is a diagram illustrating an example of hardware resources of the mail classification device 100 and the document classification device 100b described in the first and second embodiments.
The configuration in FIG. 8 is merely an example of the hardware configuration of the mail classification device 100 and the document classification device 100b, and the hardware configuration of the mail classification device 100 and the document classification device 100b is the configuration described in FIG. The configuration is not limited to this, and other configurations may be used.

図８において、メール分類装置１００及び文書分類装置１００ｂは、プログラムを実行するＣＰＵ９１１（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ、中央処理装置、処理装置、演算装置、マイクロプロセッサ、マイクロコンピュータ、プロセッサともいう）を備えている。
ＣＰＵ９１１は、バス９１２を介して、例えば、ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）９１３、ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）９１４、通信ボード９１５、表示装置９０１、キーボード９０２、マウス９０３、磁気ディスク装置９２０と接続され、これらのハードウェアデバイスを制御する。
更に、ＣＰＵ９１１は、ＦＤＤ９０４（ＦｌｅｘｉｂｌｅＤｉｓｋＤｒｉｖｅ）、コンパクトディスク装置９０５（ＣＤＤ）、プリンタ装置９０６、スキャナ装置９０７と接続していてもよい。また、磁気ディスク装置９２０の代わりに、光ディスク装置、メモリカード（登録商標）読み書き装置などの記憶装置でもよい。
ＲＡＭ９１４は、揮発性メモリの一例である。ＲＯＭ９１３、ＦＤＤ９０４、ＣＤＤ９０５、磁気ディスク装置９２０の記憶媒体は、不揮発性メモリの一例である。これらは、記憶装置の一例である。
実施の形態１及び２で説明した「分類結果蓄積用データベース１３０」、「分類ルール記憶部１７０」、「条件規則記憶部１９５」は、ＲＡＭ９１４、磁気ディスク装置９２０等により実現される。
通信ボード９１５、キーボード９０２、マウス９０３、スキャナ装置９０７、ＦＤＤ９０４などは、入力装置の一例である。
また、通信ボード９１５、表示装置９０１、プリンタ装置９０６などは、出力装置の一例である。In FIG. 8, the mail classification device 100 and the document classification device 100b include a CPU 911 (also referred to as a central processing unit, a central processing unit, a processing unit, an arithmetic unit, a microprocessor, a microcomputer, and a processor) that executes a program.
The CPU 911 is connected to, for example, a ROM (Read Only Memory) 913, a RAM (Random Access Memory) 914, a communication board 915, a display device 901, a keyboard 902, a mouse 903, and a magnetic disk device 920 via a bus 912. Control hardware devices.
Further, the CPU 911 may be connected to an FDD 904 (Flexible Disk Drive), a compact disk device 905 (CDD), a printer device 906, and a scanner device 907. Further, instead of the magnetic disk device 920, a storage device such as an optical disk device or a memory card (registered trademark) read / write device may be used.
The RAM 914 is an example of a volatile memory. The storage media of the ROM 913, the FDD 904, the CDD 905, and the magnetic disk device 920 are an example of a nonvolatile memory. These are examples of the storage device.
The “classification result storage database 130”, “classification rule storage unit 170”, and “condition rule storage unit 195” described in the first and second embodiments are realized by the RAM 914, the magnetic disk device 920, and the like.
A communication board 915, a keyboard 902, a mouse 903, a scanner device 907, an FDD 904, and the like are examples of input devices.
The communication board 915, the display device 901, the printer device 906, and the like are examples of output devices.

通信ボード９１５は、図１に示すように、例えばメールサーバに接続されている。また、通信ボード９１５は、例えば、ＬＡＮ（ローカルエリアネットワーク）、インターネット、ＷＡＮ（ワイドエリアネットワーク）、ＳＡＮ（ストレージエリアネットワーク）などに接続されていても構わない。 As shown in FIG. 1, the communication board 915 is connected to, for example, a mail server. The communication board 915 may be connected to, for example, a LAN (local area network), the Internet, a WAN (wide area network), a SAN (storage area network), or the like.

磁気ディスク装置９２０には、オペレーティングシステム９２１（ＯＳ）、ウィンドウシステム９２２、プログラム群９２３、ファイル群９２４が記憶されている。
プログラム群９２３のプログラムは、ＣＰＵ９１１がオペレーティングシステム９２１、ウィンドウシステム９２２を利用しながら実行する。The magnetic disk device 920 stores an operating system 921 (OS), a window system 922, a program group 923, and a file group 924.
The programs in the program group 923 are executed by the CPU 911 using the operating system 921 and the window system 922.

また、ＲＡＭ９１４には、ＣＰＵ９１１に実行させるオペレーティングシステム９２１のプログラムやアプリケーションプログラムの少なくとも一部が一時的に格納される。
また、ＲＡＭ９１４には、ＣＰＵ９１１による処理に必要な各種データが格納される。The RAM 914 temporarily stores at least part of the operating system 921 program and application programs to be executed by the CPU 911.
The RAM 914 stores various data necessary for processing by the CPU 911.

また、ＲＯＭ９１３には、ＢＩＯＳ（ＢａｓｉｃＩｎｐｕｔＯｕｔｐｕｔＳｙｓｔｅｍ）プログラムが格納され、磁気ディスク装置９２０にはブートプログラムが格納されている。
メール分類装置１００及び文書分類装置１００ｂの起動時には、ＲＯＭ９１３のＢＩＯＳプログラム及び磁気ディスク装置９２０のブートプログラムが実行され、ＢＩＯＳプログラム及びブートプログラムによりオペレーティングシステム９２１が起動される。The ROM 913 stores a BIOS (Basic Input Output System) program, and the magnetic disk device 920 stores a boot program.
When the mail classification device 100 and the document classification device 100b are activated, the BIOS program in the ROM 913 and the boot program in the magnetic disk device 920 are executed, and the operating system 921 is activated by the BIOS program and the boot program.

上記プログラム群９２３には、実施の形態１及び２の説明において「〜部」（「分類ルール記憶部１７０」、「条件規則記憶部１９５」以外、以下も同様）として説明している機能を実行するプログラムが記憶されている。プログラムは、ＣＰＵ９１１により読み出され実行される。 The program group 923 executes the function described as “˜part” in the description of the first and second embodiments (the same applies to the following, except for “classification rule storage unit 170” and “condition rule storage unit 195”). Program to be stored. The program is read and executed by the CPU 911.

ファイル群９２４には、実施の形態１及び２の説明において、「〜の選定」、「〜の選択」、「〜の抽出」、「〜の判断」、「〜の決定」、「〜の学習」、「〜の比較」、「〜の生成」、「〜の更新」、「〜の設定」、「〜の登録」等として説明している処理の結果を示す情報やデータや信号値や変数値やパラメータが、「〜ファイル」や「〜データベース」の各項目として記憶されている。
「〜ファイル」や「〜データベース」は、ディスクやメモリなどの記録媒体に記憶される。ディスクやメモリなどの記憶媒体に記憶された情報やデータや信号値や変数値やパラメータは、読み書き回路を介してＣＰＵ９１１によりメインメモリやキャッシュメモリに読み出され、抽出・検索・参照・比較・演算・計算・処理・編集・出力・印刷・表示などのＣＰＵの動作に用いられる。
抽出・検索・参照・比較・演算・計算・処理・編集・出力・印刷・表示のＣＰＵの動作の間、情報やデータや信号値や変数値やパラメータは、メインメモリ、レジスタ、キャッシュメモリ、バッファメモリ等に一時的に記憶される。
また、実施の形態１及び２で説明しているフローチャートの矢印の部分は主としてデータや信号の入出力を示し、データや信号値は、ＲＡＭ９１４のメモリ、ＦＤＤ９０４のフレキシブルディスク、ＣＤＤ９０５のコンパクトディスク、磁気ディスク装置９２０の磁気ディスク、その他光ディスク、ミニディスク、ＤＶＤ等の記録媒体に記録される。また、データや信号は、バス９１２や信号線やケーブルその他の伝送媒体によりオンライン伝送される。The file group 924 includes “selection of”, “selection of”, “extraction of”, “determination of”, “determination of”, “learning of” in the description of the first and second embodiments. ”,“ Comparison of ”,“ Generation of ”,“ Update of ”,“ Setting of ”,“ Registration of ”, etc. Values and parameters are stored as items of “˜file” and “˜database”.
The “˜file” and “˜database” are stored in a recording medium such as a disk or a memory. Information, data, signal values, variable values, and parameters stored in a storage medium such as a disk or memory are read out to the main memory or cache memory by the CPU 911 via a read / write circuit, and extracted, searched, referenced, compared, and calculated. Used for CPU operations such as calculation, processing, editing, output, printing, and display.
Information, data, signal values, variable values, and parameters are stored in the main memory, registers, cache memory, and buffers during the CPU operations of extraction, search, reference, comparison, calculation, processing, editing, output, printing, and display. It is temporarily stored in a memory or the like.
In addition, the arrows in the flowcharts described in the first and second embodiments mainly indicate input / output of data and signals, and the data and signal values are the RAM 914 memory, the FDD 904 flexible disk, the CDD 905 compact disk, and the magnetic field. Recording is performed on a recording medium such as a magnetic disk of the disk device 920, other optical disks, mini disks, DVDs, and the like. Data and signals are transmitted online via a bus 912, signal lines, cables, or other transmission media.

また、実施の形態１及び２の説明において「〜部」として説明しているものは、「〜回路」、「〜装置」、「〜機器」であってもよく、また、「〜ステップ」、「〜手順」、「〜処理」であってもよい。
すなわち、実施の形態１及び２で説明したフローチャートに示すステップ、手順、処理により、本発明に係るデータ処理方法を実現することができる。
また、「〜部」として説明しているものは、ＲＯＭ９１３に記憶されたファームウェアで実現されていても構わない。或いは、ソフトウェアのみ、或いは、素子・デバイス・基板・配線などのハードウェアのみ、或いは、ソフトウェアとハードウェアとの組み合わせ、さらには、ファームウェアとの組み合わせで実施されても構わない。ファームウェアとソフトウェアは、プログラムとして、磁気ディスク、フレキシブルディスク、光ディスク、コンパクトディスク、ミニディスク、ＤＶＤ等の記録媒体に記憶される。プログラムはＣＰＵ９１１により読み出され、ＣＰＵ９１１により実行される。すなわち、プログラムは、実施の形態１及び２の「〜部」としてコンピュータを機能させるものである。あるいは、実施の形態１及び２の「〜部」の手順や方法をコンピュータに実行させるものである。In addition, what is described as “˜unit” in the description of the first and second embodiments may be “˜circuit”, “˜device”, “˜device”, and “˜step”, It may be “˜procedure” or “˜processing”.
That is, the data processing method according to the present invention can be realized by the steps, procedures, and processes shown in the flowcharts described in the first and second embodiments.
Further, what is described as “˜unit” may be realized by firmware stored in the ROM 913. Alternatively, it may be implemented only by software, or only by hardware such as elements, devices, substrates, and wirings, by a combination of software and hardware, or by a combination of firmware. Firmware and software are stored as programs in a recording medium such as a magnetic disk, a flexible disk, an optical disk, a compact disk, a mini disk, and a DVD. The program is read by the CPU 911 and executed by the CPU 911. That is, the program causes the computer to function as “to part” in the first and second embodiments. Alternatively, the computer executes the procedure and method of “to unit” in the first and second embodiments.

このように、実施の形態１及び２に示すメール分類装置１００及び文書分類装置１００ｂは、処理装置たるＣＰＵ、記憶装置たるメモリ、磁気ディスク等、入力装置たるキーボード、マウス、通信ボード等、出力装置たる表示装置、通信ボード等を備えるコンピュータであり、上記したように「〜部」として示された機能をこれら処理装置、記憶装置、入力装置、出力装置を用いて実現するものである。 As described above, the mail classification device 100 and the document classification device 100b according to the first and second embodiments are configured such that a CPU as a processing device, a memory as a storage device, a magnetic disk, etc., a keyboard as an input device, a mouse, a communication board, etc. The computer includes a display device, a communication board, and the like, and implements the functions indicated as “˜unit” using the processing device, the storage device, the input device, and the output device as described above.

１００メール分類装置、１００ｂ文書分類装置、１１０分類対象データ抽出部、１２０分類部、１３０分類結果蓄積用データベース、１４０学習サンプル用データ抽出部、１５０学習サンプル選定部、１６０学習部、１７０分類ルール記憶部、１８０学習サンプル抽出条件、１９０学習サンプル選定規則、１９１サンプル合計件数上限値決定規則、１９２サンプル件数決定規則、１９３サンプル選定規則、１９５条件規則記憶部、２００メールアーカイブ装置、２００ｂ文書アーカイブ装置、２０１新規入力メール、２０１ｂ新規入力文書、２１０メール蓄積用データベース、２１０ｂ文書蓄積用データベース、３０１組織、３０２組織、３０３組織、３１１メールサーバ、３１２メールサーバ、３１３メールサーバ、３２１ユーザ端末、３２２ユーザ端末、３２３ユーザ端末、３３０ネットワーク。 100 mail classification device, 100b document classification device, 110 classification target data extraction unit, 120 classification unit, 130 classification result accumulation database, 140 learning sample data extraction unit, 150 learning sample selection unit, 160 learning unit, 170 classification rule storage Part, 180 learning sample extraction condition, 190 learning sample selection rule, 191 sample total number upper limit determination rule, 192 sample number determination rule, 193 sample selection rule, 195 condition rule storage unit, 200 mail archive device, 200b document archive device, 201 New Input Mail, 201b New Input Document, 210 Mail Storage Database, 210b Document Storage Database, 301 Organization, 302 Organization, 303 Organization, 311 Mail Server, 312 Mail Server, 313 Mail server, 321 user terminal, 322 user terminal, 323 user terminal, 330 network.

Claims

A classification unit that classifies data into one of a plurality of categories according to a classification rule;
A learning unit that performs learning using sample data and newly generates a classification rule used by the classification unit;
A sample data extraction unit that extracts, for each category, sample data used for learning of the learning unit;
A sample data selection criterion information storage unit for storing sample data selection criterion information indicating an upper limit of the total number of sample data selections and a selection criterion for sample data;
Based on the sample data selection criteria information, the number of selected sample data is maximized within the range of the upper limit value, and the number of selections that minimizes the difference between categories is calculated for each category and extracted by the sample data extraction unit. A data processing apparatus comprising: a sample data selection unit that selects sample data to be used for learning by the learning unit for each category according to the number of selections in category units from the sample data.

The sample data selection criteria information storage unit is
Stores sample data selection criteria information indicating the selection criteria of matching the total number of sample data selections to the upper limit and minimizing the difference in the number of selections between categories,
The sample data selection unit
2. The data processing according to claim 1, wherein, based on the sample data selection criterion information, the number of selections in which the total number of sample data matches the upper limit value and the difference between categories is minimized is calculated for each category. apparatus.

The sample data selection unit
When the number of sample data of category A is a, the number of sample data of category B is b, the upper limit of the total number of sample data selection is c, and c ≦ a + b,
1) When a <c / 2,
a '= a
b ′ = c−a
2) When b <c / 2,
a '= b
b ′ = c−b
3) In cases other than 1) and 2) above,
a ′ = c / 2
b ′ = c / 2
3. The data processing apparatus according to claim 2, wherein the selection number a ′ of category A and the selection number b ′ of category B are calculated by the following.

The sample data selection criteria information storage unit is
Stores sample data selection criteria information indicating the selection criteria that the total number of selected sample data is the maximum within the range of the upper limit and the number of selections is the same in all categories,
The sample data selection unit
2. The data processing apparatus according to claim 1, wherein, based on the sample data selection criterion information, the number of selected sample data is maximized within the range of the upper limit value, and the number of selections is the same in all categories.

The sample data selection unit
When the number of sample data of category A is a, the number of sample data of category B is b, the upper limit of the total number of sample data selection is c, and c ≦ a + b,
1) When a <c / 2,
a '= a
b '= a
2) When b <c / 2 a ′ = b
b '= b
3) In cases other than 1) and 2) above,
a ′ = c / 2
b ′ = c / 2
5. The data processing apparatus according to claim 4, wherein the selection number a ′ of category A and the selection number b ′ of category B are calculated by the following.

The sample data selection unit
When the total number of sample data extracted by the sample data extraction unit is less than the upper limit of the total number of sample data selected, the total number of sample data selected is the maximum within the range of the total number of sample data extracted, The data processing apparatus according to claim 1, wherein the selection number that minimizes the difference is calculated for each category.

The sample data extraction unit
Extract multiple sample data with multiple attributes for one category,
The sample data selection unit
For each category, sample data used for learning is selected so that the number of sample data attribute types is maximized and the difference in the number of sample data selections between attributes is minimized within the specified number of selections. The data processing apparatus according to claim 1.

The sample data selection unit
For each category, select one sample data for each attribute for all the attributes included in the plurality of sample data from a plurality of sample data of the corresponding category,
If the number of selected sample data is greater than or equal to the number of selections specified in the category, select sample data to be used for learning at random so that it matches the number of selections from the selected sample data,
If the number of selected sample data is less than the selected number specified in the category, the selected sample data is selected as sample data used for learning, and the unselected sample data is selected from the unselected sample data. 8. The data processing apparatus according to claim 7, wherein one sample data is selected for each attribute of all the included attributes, and insufficient sample data is selected.

The sample data extraction unit
Extract multiple sample data each with date and time information set,
The sample data selection unit
9. The data processing apparatus according to claim 8, wherein when there are a plurality of sample data having the same attribute, the sample data is selected in the order of date and time indicated in the date and time information.

The sample data extraction unit
Extract multiple sample data showing multiple titles for one category,
The data processing device includes:
The sample data selection unit
For each category, the sample data used for learning is selected so that the number of types of titles is maximized and the difference in the number of sample data selected between titles is minimized within the range of the designated number of selections. Item 8. The data processing device according to Item 7.

The sample data extraction unit
For one category, extract multiple sample data that shows multiple types of data creators,
The sample data selection unit
For each category, select the sample data to be used for learning so that the number of types of data creators is maximized and the difference in the number of sample data choices between data creators is minimized within the specified number of selections. 8. The data processing apparatus according to claim 7, wherein

The sample data extraction unit
For a single category, extract multiple emails with multiple email addresses belonging to multiple domains as multiple sample data,
The sample data selection unit
For each category, the sample data used for learning is selected so that the number of types of domains is maximized and the difference in the number of sample data selected between domains is minimized within the range of the specified number of selections. Item 8. The data processing device according to Item 7.

The sample data extraction unit
For one category, extract multiple sample data associated with multiple types of application programs,
The sample data selection unit
For each category, sample data used for learning is selected so that the number of types of application programs is maximized and the difference in the number of sample data selected between application programs is minimized within the range of the specified number of selections. The data processing apparatus according to claim 7.

The sample data extraction unit
Extract multiple sample data created with multiple types of character codes for one category,
The sample data selection unit
For each category, sample data used for learning is selected so that the number of types of character codes is maximized and the difference in the number of sample data selected between character codes is minimized within the specified number of selections. The data processing apparatus according to claim 7.

The classification unit includes:
Classify the data entered for each period into one of multiple categories,
The sample data selection criteria information storage unit is
Stores sample data selection criteria information in which the total number of sample data selected is a value obtained by multiplying the number of data input in each cycle by a predetermined ratio,
The sample data selection unit
Based on the sample data selection criteria information, a value obtained by multiplying the number of input data by the predetermined ratio is calculated as the total number of selected sample data for each period, and the learning is performed using the calculated total number of selected sample data. The data processing apparatus according to claim 1, wherein sample data used for part learning is selected.

The computer classifies the data into one of several categories according to the classification rules,
The computer performs learning using sample data, and newly generates a classification rule used for data classification,
The computer extracts sample data used for the learning for each category,
Based on the sample data selection criteria information that shows the upper limit of the total number of sample data selections and the sample data selection criteria, the computer determines that the total number of sample data selections reaches the maximum within the range of the upper limit values and between categories. A data processing method characterized by calculating a selection number that minimizes a difference for each category, and selecting sample data used for learning for each category according to the selection number for each category from the extracted sample data.

A classification process for classifying data into one of multiple categories according to a classification rule;
A learning process that performs learning using sample data and newly generates a classification rule used for the classification process;
Sample data extraction processing for extracting sample data used for learning of the learning processing for each category;
A read process for reading sample data selection criteria information indicating the upper limit of the total number of sample data selections and the sample data selection criteria;
Based on the sample data selection criteria information, the number of selected sample data is maximized within the range of the upper limit value, and the number of selections that minimizes the difference between categories is calculated for each category, and extracted by the sample data extraction process. A program for causing a computer to execute sample data selection processing for selecting sample data to be used for learning in the learning processing for each category according to the number of selected categories in the sample data.