JP2005182218A

JP2005182218A - Dictionary editing device, and document classifying device and its program

Info

Publication number: JP2005182218A
Application number: JP2003418871A
Authority: JP
Inventors: Yoshihiro Ueda; 芳弘上田; Naotaka Kato; 直孝加藤; Katsuaki Hayashi; 克明林
Original assignee: Ishikawa Prefecture; Ishikawa Prefectural Government
Current assignee: Ishikawa Prefecture; Ishikawa Prefectural Government
Priority date: 2003-12-17
Filing date: 2003-12-17
Publication date: 2005-07-07
Anticipated expiration: 2023-12-17
Also published as: JP4496347B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a document classifying device for automatically and accurately classifying a document to a proper person in charge. <P>SOLUTION: This document classifying device(mail analyzing device) 2 is provided with a dictionary 4 constituted of a significance dictionary 41 in which a tf/idf value is stored for each word appearing in the category and a simultaneous appearance dictionary 42 in which an idf/conf value is stored for each of the combination of the first word and the second word for the word appearing in the category. The document classifying device 2 is configured to calculate the tf/idf value and idf/conf value of each word of the dictionary by collating the word appearing in an inputted document in the dictionary 4, and to calculate the scores of each category on the basis of the scores of each word calculated by performing a predetermined arithmetic operation based on them, and to classify the inputted document into any of the plurality of categories on the basis of this. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

本発明は、辞書編集装置、文書分類装置及びそのプログラムに関し、特に、電子メールやＷｅｂ等で入力した電子文書を、適切な担当者に自動的に正確に分類するための辞書編集装置、文書分類装置及びそのプログラムに関する。 The present invention relates to a dictionary editing apparatus, a document classification apparatus, and a program thereof, and more particularly, to a dictionary editing apparatus and document classification for automatically and accurately classifying an electronic document input by e-mail, Web, or the like to an appropriate person in charge. The present invention relates to an apparatus and a program thereof.

最近、企業のコールセンター等では、電話やＦＡＸに依らず、電子メールやＷｅｂで顧客からの問い合わせに対応することへのニーズが高まっている。しかし、電子メール等による問い合わせ（問い合わせメール）に少人数で対応するためには、適切な担当者へ問い合わせメールを分類し、回答しなければならない場合が多い。電子メールやＷｅｂにより充実したサービスを提供するためには、このような分類業務の効率化が重要であり、これを自動分類するシステムが強く望まれている。 In recent years, there has been an increasing need for responding to inquiries from customers by e-mail or the Web, regardless of telephone or FAX, in a call center of a company. However, in order to respond to an inquiry (inquiry mail) by e-mail or the like with a small number of people, it is often necessary to classify and answer the inquiry mail to an appropriate person in charge. In order to provide a rich service by e-mail or Web, it is important to improve the efficiency of such classification work, and a system for automatically classifying this is strongly desired.

このような電子メールの自動分類の手段としては、電子メールを分類するための類推ルールをサンプルデータから学習して生成し、これに基づいて電子メールを自動分類できるようにした装置が知られている（例えば、特許文献１参照）。 As a means for automatically classifying such e-mails, there is known an apparatus that learns and generates analogy rules for classifying e-mails from sample data and can automatically classify e-mails based on the learning rules. (For example, refer to Patent Document 1).

一方、文書分類に関する多くの研究において、単語単独の重要度であるtf・idf 値(term frequency times inverse document frequency)が応用されている。更に、本発明者等により、tf・idf 値の他に、２単語間の共起性を表すidf/conf値を用いて、電子メール等の文書分類を行うことが提案されている（非特許文献１参照）。
特開２００１−２５６２５１号公報成田他、（Ｆ−６７）テキストマイニングと強化学習による電子メール自動分配、平成１４年度電気関係学会北陸支部連合大会 On the other hand, in many studies on document classification, tf · idf values (term frequency times inverse document frequency), which are the importance of words alone, are applied. Furthermore, it has been proposed by the present inventors to classify documents such as e-mails using an idf / conf value representing the co-occurrence between two words in addition to the tf · idf value (non-patent document). Reference 1).
JP 2001-256251 A Narita et al. (F-67) Automatic email distribution by text mining and reinforcement learning, 2002 Hokuriku Branch Association

前述のように、本発明者等は、先にtf・idf 値及びidf/conf値を用いて電子メール等の文書分類を行うことを提案した。しかし、その後の本発明者の検討によれば、tf・idf 値とidf/conf値を用いただけの文書分類では、電子メールの分類等に用いた場合、実用的な分類精度が得られないことが判った。即ち、tf・idf 値及びidf/conf値を併せて１個のパラメータとして用いて１個の辞書を作成及び使用したのでは実用に耐え得る分類結果が得られず、一方、tf・idf 値とidf/conf値とを別々の独立した２個のパラメータとして用いて２個の辞書を作成及び使用すると、実用的な分類の実現に有効であることが判った。更に、辞書の作成後も単語の重み（ウェイト）を学習することが有効であるが、別々の独立した２個のパラメータであるtf・idf 値とidf/conf値とについて個々に学習すると、実用的な分類の実現に有効であることが判った。そして、この学習の課程において、３種類の学習（重要語、特定語、不要語）を行うことが有効であることが判った。 As described above, the present inventors have previously proposed to classify documents such as e-mails using the tf · idf value and idf / conf value. However, according to the inventor's examination after that, the document classification using only the tf / idf value and idf / conf value does not provide practical classification accuracy when used for e-mail classification. I understood. That is, if one dictionary is created and used using tf · idf value and idf / conf value together as one parameter, a practical classification result cannot be obtained, while tf · idf value and It was found that creating and using two dictionaries using idf / conf values as two separate independent parameters is effective in realizing a practical classification. In addition, it is effective to learn word weights after creating a dictionary, but it is practical to learn two separate independent parameters, tf · idf value and idf / conf value. It was found to be effective in realizing a realistic classification. It was found that it is effective to perform three types of learning (important words, specific words, unnecessary words) in this learning process.

本発明は、文書を適切な担当者に自動的に正確に分類するための辞書を作成する辞書編集装置を提供することを目的とする。 It is an object of the present invention to provide a dictionary editing apparatus that creates a dictionary for automatically and accurately classifying a document to an appropriate person in charge.

また、本発明は、文書を適切な担当者に自動的に正確に分類する文書分類装置を提供することを目的とする。 It is another object of the present invention to provide a document classification device that automatically and accurately classifies a document to an appropriate person in charge.

また、本発明は、文書を適切な担当者に自動的に正確に分類する文書分類プログラムを提供することを目的とする。 It is another object of the present invention to provide a document classification program that automatically and accurately classifies a document to an appropriate person in charge.

本発明の辞書編集装置は、属するカテゴリの判っている文書に基づいて辞書を作成する辞書編集装置であって、当該カテゴリにおいて出現した単語毎に、単語単独の重要度を表すtf・idf 値を格納する重要性辞書を作成し、当該カテゴリにおいて出現した単語についての複数の単語の組み合わせ毎に、当該単語間の共起性を表すidf/conf値を格納する同時出現性辞書を作成する。 The dictionary editing apparatus of the present invention is a dictionary editing apparatus that creates a dictionary based on a document in which a category belongs, and for each word that appears in the category, a tf · idf value that represents the importance of the word alone is obtained. An importance dictionary to be stored is created, and a co-occurrence dictionary for storing an idf / conf value representing co-occurrence between the words is created for each combination of a plurality of words with respect to the words that appear in the category.

本発明の文書分類装置は、複数のカテゴリの各々について設けられる複数のカテゴリ辞書からなり、各々のカテゴリ辞書が当該カテゴリにおいて出現した単語毎に単語単独の重要度を表すtf・idf 値を格納する重要性辞書と、当該カテゴリにおいて出現した単語についての第１単語と第２単語の組み合わせ毎に２単語間の共起性を表すidf/conf値を格納する同時出現性辞書とからなる辞書と、入力された文書に出現する単語を用いて前記辞書で当該単語を照合して、前記辞書の単語毎のtf・idf 値とidf/conf値とを求め、これらに基づいて所定の演算を行って前記単語毎のスコアを算出し、前記単語毎のスコアに基づいて前記カテゴリ毎のスコアを算出し、これに基づいて前記入力された文書を前記複数のカテゴリのいずれかに分類する文書分類装置とからなる。 The document classification apparatus according to the present invention includes a plurality of category dictionaries provided for each of a plurality of categories, and each category dictionary stores a tf · idf value representing the importance of a single word for each word that appears in the category. A dictionary comprising an importance dictionary and a co-occurrence dictionary storing idf / conf values representing co-occurrence between two words for each combination of the first word and the second word for the words appearing in the category; Using the words that appear in the input document, the words are compared in the dictionary to obtain tf · idf values and idf / conf values for each word in the dictionary, and a predetermined calculation is performed based on these values. A document classification device that calculates a score for each word, calculates a score for each category based on the score for each word, and classifies the input document into one of the plurality of categories based on the score Consists of.

本発明のプログラムは、文書分類装置を実現するプログラムであって、前記プログラムは、コンピュータに、入力された文書に出現する単語を用いて、複数のカテゴリの各々について設けられる複数のカテゴリ辞書からなり、各々のカテゴリ辞書が当該カテゴリにおいて出現した単語毎に単語単独の重要度を表すtf・idf 値を格納する重要性辞書と、当該カテゴリにおいて出現した単語についての第１単語と第２単語の組み合わせ毎に２単語間の共起性を表すidf/conf値を格納する同時出現性辞書とからなる辞書で当該単語を照合して、前記辞書の単語毎のtf・idf 値とidf/conf値とを求める処理と、前記単語毎のtf・idf 値とidf/conf値とに基づいて所定の演算を行って前記単語毎のスコアを算出する処理と、前記単語毎のスコアに基づいて前記カテゴリ毎のスコアを算出する処理と、前記カテゴリ毎のスコアに基づいて前記入力された文書を前記複数のカテゴリのいずれかに分類する処理とを実行させる。 The program of the present invention is a program for realizing a document classification device, and the program includes a plurality of category dictionaries provided for each of a plurality of categories using words appearing in an input document in a computer. An importance dictionary that stores tf · idf values representing the importance of each word for each word that appears in the category, and a combination of the first word and the second word for the word that appears in the category Each word is collated with a dictionary composed of a co-occurrence dictionary that stores idf / conf values representing co-occurrence between two words every time, and the tf · idf value and idf / conf value for each word in the dictionary A process for calculating a score for each word by performing a predetermined calculation based on the tf · idf value and idf / conf value for each word, and the category based on the score for each word A process of calculating the score, a document that was the input based on the scores for each of the categories to perform the processing for classifying the one of the plurality of categories.

本発明の辞書編集装置によれば、単語単独の重要度を表すtf・idf 値と複数の単語間の共起性を表すidf/conf値とを別々の独立した２個のパラメータとして用いて２個の辞書（tf・idf辞書、idf/conf辞書）を作成することができるので、これを文書分類装置の辞書として用いることにより、電子メール等の文書の分類に用いた場合、実用的な分類精度を得ることができる。従って、文書を適切な担当者に自動的に正確に分類するための辞書を容易に作成することができる。 According to the dictionary editing apparatus of the present invention, the tf · idf value representing the importance of a single word and the idf / conf value representing the co-occurrence between a plurality of words are used as two independent independent parameters. Individual dictionaries (tf / idf dictionaries, idf / conf dictionaries) can be created. By using this as a dictionary for document classification devices, practical classification is possible when used for classification of documents such as e-mails. Accuracy can be obtained. Therefore, it is possible to easily create a dictionary for automatically and accurately classifying a document to an appropriate person in charge.

本発明の文書分類装置によれば、前述の２個の辞書tf・idf辞書、idf/conf辞書を用いることにより、入力した文書を基本的には単語単独の重要度を表すtf・idf 値と複数（２つ）の単語間の共起性を表すidf/conf値とに基づいてカテゴリに分類できるので、電子メール等の文書の分類に用いた場合、実用的な分類精度を得ることができる。従って、文書を適切な担当者に自動的に正確に分類することができる。 According to the document classification apparatus of the present invention, by using the above-described two dictionaries tf / idf dictionary and idf / conf dictionary, the input document basically has a tf · idf value representing the importance of a single word. Since it can be classified into categories based on the idf / conf value representing the co-occurrence between multiple (two) words, practical classification accuracy can be obtained when used for classification of documents such as e-mails. . Accordingly, it is possible to automatically and accurately classify the document into an appropriate person in charge.

本発明の文書分類プログラムによれば、これをフレキシブルディスク、ＣＤ−ＲＯＭ、ＣＤ−Ｒ／Ｗ、ＤＶＤ等の媒体に格納すること、又は、インターネット等のネットワークを介してダウンロードすることにより供給することができ、これにより前述の文書分類システムを容易に実現することができ、正確な文書分類を可能とすることができる。 According to the document classification program of the present invention, the program is supplied by being stored in a medium such as a flexible disk, a CD-ROM, a CD-R / W, a DVD, or downloaded via a network such as the Internet. Thus, the above-described document classification system can be easily realized, and accurate document classification can be realized.

図１は、文書分類システム構成図であり、本発明の文書分類システムの構成を示す。文書分類システムは、３個のサブシステム、即ち、辞書編集装置１、文書分類装置２、強化学習装置３からなる。これらの間は、例えばＬＡＮ（Local Area Network）により相互に接続される。この例では、文書分類装置２は、例えばメール解析装置２からなる。従って、以下の例においては、分類（解析）対象である前記入力された文書は例えば電子メール８であり、前記カテゴリは文書を分配すべき担当者（回答者）であり、前記文書分類装置２は入力された文書を当該分類された担当者に配信（分配）する。 FIG. 1 is a block diagram of the document classification system, showing the configuration of the document classification system of the present invention. The document classification system includes three subsystems, that is, a dictionary editing device 1, a document classification device 2, and a reinforcement learning device 3. These are connected to each other by, for example, a LAN (Local Area Network). In this example, the document classification device 2 includes a mail analysis device 2, for example. Accordingly, in the following example, the input document to be classified (analyzed) is, for example, an e-mail 8, the category is a person in charge (respondent) who should distribute the document, and the document classification device 2 Distributes (distributes) the input document to the classified person in charge.

辞書編集装置１は、文書ファイル６に基づいて辞書を作成する辞書編集装置であって、当該担当者において出現した単語毎に、単語単独の重要度を表すtf・idf 値を格納する重要性辞書（tf・idf辞書）４１を作成し、また、当該担当者において出現した単語についての複数の単語の組み合わせ毎に、当該単語間の共起性を表すidf/conf値を格納する同時出現性辞書（idf/conf辞書）４２を作成する。文書ファイル６は、各担当者が日常業務で作成した文書である（文書を格納している）ので、属するカテゴリ即ちその担当者の判っている文書である。従って、当該担当者において出現した単語とは、当該担当者の作成した文書ファイル６に出現した単語である。この例では、同時出現性辞書４２は、当該担当者において出現した単語についての第１単語と第２単語の組み合わせ毎に２単語間の共起性を表すidf/conf値を格納する。従って、辞書４は、複数の担当者の各々について設けられる複数の担当者（の）辞書４０からなり、各々の担当者辞書４０が重要性辞書４１と同時出現性辞書４２とからなる。 The dictionary editing device 1 is a dictionary editing device that creates a dictionary based on a document file 6 and stores an importance dictionary that stores tf · idf values representing the importance of a single word for each word that appears in the person in charge. (Tf / idf dictionary) 41 is created, and a co-occurrence dictionary storing idf / conf values representing the co-occurrence between the words for each combination of a plurality of words with respect to the words appearing in the person in charge (Idf / conf dictionary) 42 is created. Since the document file 6 is a document created by each person in charge in daily work (stores the document), the document file 6 is a document to which the person in charge, that is, the person in charge knows. Therefore, the word that appears in the person in charge is a word that appears in the document file 6 created by the person in charge. In this example, the co-occurrence dictionary 42 stores an idf / conf value representing the co-occurrence between two words for each combination of the first word and the second word for the word that appears in the person in charge. Accordingly, the dictionary 4 includes a plurality of person in charge dictionaries 40 provided for each of a plurality of persons in charge, and each person in charge dictionary 40 includes an importance dictionary 41 and a co-occurrence dictionary 42.

辞書編集装置１は、実際には、文書ファイル（担当者毎のフォルダ）６に保存された各種の文書を変換して得たテキストファイルを読み込み、読み込んだテキストデータから改行とスペースを取り除いた後に、周知の形態素解析処理により単語に分割することにより抽出した単語を、担当者毎の辞書４０に登録する。抽出した単語において、品詞が助詞、助動詞、接続詞、接頭詞、副詞、連体詞、感動詞、記号は不要語と考えられるので、辞書４には登録しない。 The dictionary editing apparatus 1 actually reads a text file obtained by converting various documents stored in a document file (folder for each person in charge) 6, and removes line breaks and spaces from the read text data. The words extracted by dividing into words by a known morphological analysis process are registered in the dictionary 40 for each person in charge. In the extracted word, the part of speech is assumed to be a particle, auxiliary verb, conjunction, prefix, adverb, adjunct, exchanging verb, and symbol, and are not registered in the dictionary 4 because they are considered unnecessary words.

即ち、本発明では、辞書４の作成及び編集において、単語単独の重要度であるtf・idf 値を用い、更に、単語間の共起性をidf/conf値（idf divided by confidence）をも用いる。ここで、tf値は文書中に現れた単語の頻度（即ち、単語の重要性）を示し、idf 値は単語がどの程度特定の分類に現れるか（即ち、単語の特定性）を示し、tf・idf 値は文書分類の研究で用いられているものであり単語の重要性と特定性とを加味した重み（即ち、単語単独の重要度）を示し、conf値は単語の共起性を示す確信度を示し、idf/conf値は特定性の高い単語間の共起性を示す。 That is, in the present invention, in the creation and editing of the dictionary 4, the tf · idf value that is the importance of each word is used, and the co-occurrence between words is also used as an idf / conf value (idf divided by confidence). . Here, the tf value indicates the frequency of words appearing in the document (ie, the importance of the word), the idf value indicates how much the word appears in a specific classification (ie, the word specificity), and tf・ The idf value is used in the study of document classification and indicates the weight that takes into account the importance and specificity of the word (that is, the importance of the word alone), and the conf value indicates the co-occurrence of the word The confidence level is shown, and the idf / conf value shows the co-occurrence between highly specific words.

図２（Ａ）はtf・idf 値の辞書（重要性辞書）４１の一例を示す。重要性辞書４１は、その担当者が使用した単語ａ１、ａ２、ａ３、・・毎に、そのtf値、idf 値、tf・idf 値を格納する。tf値、idf 値、tf・idf 値の算出については後述する。図２（Ｂ）はidf/conf値の辞書（同時出現性辞書）４２の一例を示す。同時出現性辞書４２は、その担当者が使用した単語ｂ１、ｂ２、ｂ３、・・における第１単語と第２単語の組み合わせ毎に、当該第１単語のidf 値、第１単語と第２単語間のconf値、第１単語と第２単語間のidf/conf値を格納する。前述のように、この例におけるidf/conf値は、当該２単語間の特定性を考慮した共起性を示す。conf値、idf/conf値の算出については後述する。 FIG. 2A shows an example of a dictionary (importance dictionary) 41 of tf · idf values. The importance dictionary 41 stores the tf value, idf value, tf · idf value for each word a1, a2, a3,... Used by the person in charge. Calculation of the tf value, idf value, and tf · idf value will be described later. FIG. 2B shows an example of a dictionary (co-occurrence dictionary) 42 of idf / conf values. For each combination of the first word and the second word in the words b1, b2, b3,... Used by the person in charge, the simultaneous appearance dictionary 42 has the idf value of the first word, the first word and the second word. Conf value between, and idf / conf value between the first word and the second word are stored. As described above, the idf / conf value in this example indicates co-occurrence considering the specificity between the two words. The calculation of the conf value and idf / conf value will be described later.

メール解析装置２は、入力された文書である問い合わせの電子メール（問い合わせメール）８に出現する単語を用いて辞書４で当該単語を照合して、辞書４の単語毎のtf・idf 値とidf/conf値とを求め、これらに基づいて所定の演算を行って単語毎のスコアを算出し、単語毎のスコアに基づいて担当者毎のスコアを算出し、これに基づいて入力された文書を複数の担当者のいずれかに分類する。この例では、メール解析装置２は、前記所定の演算として、辞書４の単語毎のtf・idf 値とidf/conf値と入力された文書に出現する単語との一致率との積を算出して、単語毎のスコアを算出する。また、メール解析装置２は、担当者毎のスコアに基づいて、入力された文書である問い合わせメール（以下、単にメール）８を、当該スコアが上位から所定の数の担当者に分類する。 The mail analysis device 2 collates the words in the dictionary 4 using words appearing in the inquiry e-mail (inquiry mail) 8 that is the input document, and determines the tf · idf value and idf for each word in the dictionary 4. / conf value is calculated, based on these, the score for each word is calculated, the score for each person in charge is calculated based on the score for each word, and the document input based on this is calculated. Classify it as one of multiple personnel. In this example, the mail analysis device 2 calculates the product of the tf · idf value for each word in the dictionary 4 and the match rate between the idf / conf value and the word appearing in the input document as the predetermined calculation. To calculate a score for each word. Further, the mail analysis device 2 classifies the inquiry mail (hereinafter simply referred to as mail) 8 that is an input document into a predetermined number of persons in charge from the top, based on the score of each person in charge.

メール解析装置２は、辞書編集装置１と同様に、入力されたメール８の本文について、改行とスペースを取り除いた後に周知の形態素解析処理を行って、単語に分割することにより得られた単語と辞書４とを照合する。この時、メール解析装置２は、マッチした（照合できた）単語又は単語の組み合わせに関するtf・idf 値とidf/conf値と単語の一致率とから、当該メール８に対する全ての担当者のスコアを算出して、最後にスコアの高い一人又は複数の担当者を当該メール８への回答者（当該メール８に回答すべき担当者）として推定し、図３に示すように、その結果を例えばメール解析装置２に表示する。即ち、質問者からのメール８を入力として、これに回答すべき担当者の候補（即ち、図３）を出力として得る。また、この例では、メール解析装置２は、この分類の結果（スコア）に従って、図３において「〇」の付された担当者の担当者（回答者）端末５に、当該入力されたメール８を分配する（配信する）。 Similarly to the dictionary editing device 1, the mail analysis device 2 performs the well-known morphological analysis processing on the body of the input mail 8 after removing line breaks and spaces, and obtains the word obtained by dividing the word into words. The dictionary 4 is checked. At this time, the mail analysis device 2 calculates the scores of all persons in charge for the mail 8 from the tf / idf value, the idf / conf value, and the word match rate for the matched word or word combination. Finally, one or a plurality of persons in charge who have a high score are estimated as respondents to the mail 8 (persons who should answer the mail 8), and the result is, for example, mail as shown in FIG. Displayed on the analysis device 2. That is, the mail 8 from the interrogator is input, and the candidate for the person in charge (that is, FIG. 3) to be answered is obtained as the output. Further, in this example, the mail analyzing apparatus 2 sends the inputted mail 8 to the person in charge (respondent) terminal 5 of the person in charge in FIG. 3 according to the classification result (score). Is distributed (distributed).

図３は、メール解析結果の一例を示す。当該結果は、担当者ｄ毎に、回答者とするか否かの結果、スコアScore(d)、メール8と辞書とで照合できた単語t、当該単語毎のスコア(d，t)からなる。担当者ｄ及び照合単語ｔはスコアの高い順に表示される。最もスコアの高いもの（第１(i)位）を第１(i)位照合単語という。図３において、「〇」の付された担当者が回答者とされる。この例では上位２人が回答者として推定される。入力された（受信された）メール８は、回答者とされた担当者にメール８として分類（送信）される。一方、「×」の付された担当者は回答者と推定されない。人間の分類者を介在させる場合や分類を受けた担当者が他の担当者に再分類する場合は、図３に示す推定結果を参照して分類を行うようにすれば良い。 FIG. 3 shows an example of the mail analysis result. The result consists of a score score (d), a word t that can be collated between the mail 8 and the dictionary, and a score (d, t) for each word for each person in charge d. . The person in charge d and the matching word t are displayed in descending order of score. The one with the highest score (first (i) position) is called the first (i) position collation word. In FIG. 3, the person in charge with “◯” is the respondent. In this example, the top two people are estimated as respondents. The input (received) mail 8 is classified (transmitted) as the mail 8 to the person in charge who is the respondent. On the other hand, the person in charge with “x” is not estimated as a respondent. In the case of interposing a human classifier or when the person in charge who has received the classification reclassifies to another person in charge, the classification may be performed with reference to the estimation result shown in FIG.

なお、図示はしないが、メール解析装置２は、例えばＬＡＮを介して回答者端末５に接続され、また、メールサーバに接続される。メールサーバは、インターネットに接続され、外部からのメール８を受信し、これを内部の各端末５に配信するために、メール解析装置２に送る。メール８は、例えば質問者が周知のＷｅｂブラウザを用いて、問い合わせ内容、氏名、返信用電子メールアドレス等を入力することにより作成（入力）する。 Although not shown, the mail analysis device 2 is connected to the respondent terminal 5 via a LAN, for example, and is connected to a mail server. The mail server is connected to the Internet, receives mail 8 from the outside, and sends it to the mail analysis device 2 for distribution to each terminal 5 inside. The e-mail 8 is created (input) by inputting the inquiry content, name, e-mail address for reply, etc. using, for example, a Web browser known to the questioner.

本発明では、２個の辞書４１、４２に基づいて得た２個の重みをそのまま用いるのではなく、マッチした（照合できた）単語の一致率をも考慮してスコアを算出する。即ち、処理対象であるメール８に出現する単語と辞書の単語との照合を、完全一致ではなく、部分一致で行うと共に、当該部分一致の割合（単語の一致率）をスコアの算出に用いる。これにより、同一の用語についての僅少な個人差の影響等を排除することができる。 In the present invention, the two weights obtained based on the two dictionaries 41 and 42 are not used as they are, but the score is calculated in consideration of the matching rate of matched (matched) words. That is, the words appearing in the mail 8 to be processed and the words in the dictionary are not matched completely but partially matched, and the partial match rate (word match rate) is used for score calculation. Thereby, the influence of a slight individual difference etc. about the same term can be excluded.

強化学習装置３は、所定の文書（即ち、別文書ファイル７）をメール解析装置２により複数の担当者のいずれかに分類した結果に基づいて、一旦作成した重要性辞書４１におけるtf・idf 値を更新し、同時出現性辞書４２におけるidf/conf値を更新する（強化する）。即ち、強化学習装置３は、重要語学習、特定語学習、不要語学習を行う。重要語学習処理は、所定の文書が真の担当者に分類された場合、前記所定の文書に出現する単語であって当該真の担当者の辞書４０で照合された単語を重要語として、その重みを大きくする。特定語学習の処理は、所定の文書が真の担当者に分類されなかった場合、前記所定の文書に出現する単語の中で当該真の担当者について特定性の高い単語を再評価することにより、当該真の担当者の特定語の重みを大きくする。不要語学習は、所定の文書が真の担当者以外の担当者に分類された場合、前記所定の文書に出現する単語であって当該誤って分類された担当者（真の担当者以外の担当者（カテゴリ））の辞書４０で照合した単語を不要語として、当該誤って分類された担当者の辞書４０における重みを小さくし、真の担当者以外の担当者の辞書４０における重みを小さくする。これにより、メール８の分類精度を更に向上することができる。 The reinforcement learning device 3 determines the tf · idf value in the importance dictionary 41 once created based on the result of classifying a predetermined document (that is, another document file 7) into one of a plurality of persons in charge by the mail analysis device 2. And the idf / conf value in the co-occurrence dictionary 42 is updated (enhanced). That is, the reinforcement learning device 3 performs important word learning, specific word learning, and unnecessary word learning. In the important word learning process, when a predetermined document is classified as a true person in charge, a word that appears in the predetermined document and is collated in the dictionary 40 of the true person in charge is regarded as an important word. Increase the weight. In the specific word learning process, when a predetermined document is not classified as a true person in charge, among words appearing in the predetermined document, a word having high specificity for the true person is re-evaluated. , Increase the weight of the specific word of the true person in charge. Unnecessary word learning is performed when a given document is classified as a person other than the true person, and the person who appears in the given document and is classified incorrectly (in charge other than the true person) A word in the dictionary 40 of the person in charge (category) is regarded as an unnecessary word, and the weight in the dictionary 40 of the person in charge classified incorrectly is reduced, and the weight in the dictionary 40 of persons in charge other than the true person in charge is reduced. . Thereby, the classification accuracy of the mail 8 can be further improved.

本発明では、一旦作成した辞書４におけるtf・idf 値とidf/conf値を強化するために、別文書ファイル７、即ち、分類を受ける各担当者自身が様々な業務の中で作成した文書を読み込んだファイルを用いる。別文書ファイル７は、文書ファイル６及びメール８とは別の文書（文書ファイル）であって、各担当者が作成した別文書である（別文書を格納する）。即ち、別文書ファイル７は、担当者が随時作成する文書を随時取り込んで格納したものである。別文書ファイル７は、文書ファイル６と同様にして、メール解析装置２により複数の担当者のいずれかに分類される。別文書ファイル７の担当者も明確であるのでこの結果は正確に評価することができ、また、担当者の業務内容の変化に応じて随時tf・idf 値とidf/conf値を強化することができ、メール８の正確な分類に有効である。 In the present invention, in order to reinforce the tf · idf value and idf / conf value in the dictionary 4 once created, a separate document file 7, that is, a document created by each person in charge who receives the classification in various tasks is created. Use the read file. The separate document file 7 is a separate document (document file) from the document file 6 and the mail 8 and is a separate document created by each person in charge (stores the separate document). In other words, the separate document file 7 is a file in which a document created by a person in charge at any time is taken in and stored. The separate document file 7 is classified into one of a plurality of persons in charge by the mail analysis device 2 in the same manner as the document file 6. Since the person in charge of the separate document file 7 is also clear, this result can be evaluated accurately, and the tf / idf value and idf / conf value can be strengthened as needed according to the changes in the work contents of the person in charge. This is effective for accurate classification of the mail 8.

また、本発明では、重要語学習、特定語学習、不要語学習即ち、tf・idf 値とidf/conf値の強化を、プロフィットシェアリング（以下ＰＳと言う）により行う。即ち、別文書ファイル７についてその担当者（分類先）を推定し、その推定結果の適切さに応じた報酬（プロフィット）によりtf・idf 値とidf/conf値を補正する。ＰＳは強化学習において注目されている。これにより、分類の専門家と同等レベルの分類精度を得ることができる。ＰＳは、周知のように、報酬を得たときに複数ルール（本発明では複数単語の重み）を一括して強化するので、効率的にメール８の分類精度を向上することができる。なお、このように強化学習を文書分類に取り入れたシステムは、これまでに開発されていない。 In the present invention, important word learning, specific word learning, unnecessary word learning, that is, enhancement of tf · idf value and idf / conf value is performed by profit sharing (hereinafter referred to as PS). That is, the person in charge (classification destination) of the separate document file 7 is estimated, and the tf · idf value and idf / conf value are corrected by the reward (profit) according to the appropriateness of the estimation result. PS is attracting attention in reinforcement learning. This makes it possible to obtain a classification accuracy equivalent to that of a classification specialist. As is well known, PS reinforces a plurality of rules (a weight of a plurality of words in the present invention) at a time when a reward is obtained, so that the classification accuracy of the mail 8 can be improved efficiently. In addition, a system that incorporates reinforcement learning into document classification in this way has not been developed so far.

以上のように、本発明によれば、２個の辞書４１、４２に基づいて得た２個の重みを用いてメール８等の文書を分類することにより、実用的な分類を実現することができる。即ち、日常業務としてメール８を分類している専門家の分類結果（分類精度）が、実用上必要な精度と考えることができる。本発明のメール解析装置２による分類精度として、当該専門家の分類精度とほぼ同等の精度を得ることができる。従って、本発明のメール解析装置２（による分類）は十分に実用に耐え得るものである。なお、本発明者の検討によれば、実際は、２個の辞書４１、４２のみを用いた分類精度は当該専門家の精度を少し下回るが、tf・idf 値とidf/conf値を別文書ファイル７を用いて強化学習することにより、分類の専門家と同等な精度でメール８を分類することができる。 As described above, according to the present invention, a practical classification can be realized by classifying a document such as the mail 8 using two weights obtained based on the two dictionaries 41 and 42. it can. That is, the classification result (classification accuracy) of an expert who classifies the mail 8 as a daily work can be considered as a practically necessary accuracy. As the classification accuracy by the mail analysis device 2 of the present invention, it is possible to obtain an accuracy substantially equal to the classification accuracy of the expert. Therefore, the mail analysis apparatus 2 (classification by the present invention) of the present invention can sufficiently withstand practical use. According to the study of the present inventor, the classification accuracy using only the two dictionaries 41 and 42 is actually slightly lower than the accuracy of the expert, but the tf / idf value and idf / conf value are stored in a separate document file. By performing reinforcement learning using 7, the mail 8 can be classified with the same accuracy as a classification specialist.

以下、処理フローを参照して、本発明の文書分類システムにおける処理について、詳細に説明する。図４は、文書分類処理フローであり、本発明の図１に示す文書分類システムにおける文書分類処理を示す。辞書編集装置１が、各担当者が作成した文書ファイル６を収集して、この中の出現単語の重みを、単語単独の重要度を表すtf・idf 値と２単語間の共起性を表すidf/conf値として算出し、この２種類の辞書４１、４２を担当者毎に作成する（ステップS101）。強化学習装置３が、ＰＳを応用して、これらの重み又はウェイト（を示すtf・idf 値及びidf/conf値）を強化学習する（ステップS102）。即ち、メール解析装置２が、メール８の受信の有無を判断し、当該受信がない場合、別文書ファイル７の文書と２種類の辞書４１、４２を照合して、単語の重みと単語の一致率とから、担当者毎にスコアを算出し、このスコアが高い担当者を当該メール８への回答者として推定し、当該推定の結果を強化学習装置３に入力してtf・idf 値及びidf/conf値の重みを更新する。一方、メール８を受信すると、メール解析装置２は当該メール８を解析する（ステップS103）。即ち、メール８と２種類の辞書４１、４２を照合して、単語の重みと単語の一致率とから、担当者毎にスコアを算出する。そして、メール解析装置２は、このスコアが高い担当者を当該メール８への回答者として推定し、これをメール解析装置２の端末（図示せず）に表示する（ステップS104）。 Hereinafter, the processing in the document classification system of the present invention will be described in detail with reference to the processing flow. FIG. 4 is a document classification processing flow, and shows the document classification processing in the document classification system shown in FIG. 1 of the present invention. The dictionary editing device 1 collects the document files 6 created by each person in charge, and the weights of the words appearing therein are expressed as tf · idf values representing the importance of the words alone and the co-occurrence between the two words. These are calculated as idf / conf values, and these two types of dictionaries 41 and 42 are created for each person in charge (step S101). The reinforcement learning device 3 applies PS to reinforcement learning of these weights or weights (indicating tf · idf value and idf / conf value) (step S102). That is, the mail analysis device 2 determines whether or not the mail 8 has been received, and if there is no such reception, the document in the separate document file 7 is checked against the two types of dictionaries 41 and 42 to match the word weight with the word. From the rate, a score is calculated for each person in charge, a person in charge with a high score is estimated as a respondent to the mail 8, and the result of the estimation is input to the reinforcement learning device 3, and the tf · idf value and idf Update the weight of the / conf value. On the other hand, when the mail 8 is received, the mail analysis device 2 analyzes the mail 8 (step S103). That is, the mail 8 and the two types of dictionaries 41 and 42 are collated, and a score is calculated for each person in charge from the word weight and the word match rate. Then, the mail analysis apparatus 2 estimates a person in charge with a high score as a respondent to the mail 8, and displays this on a terminal (not shown) of the mail analysis apparatus 2 (step S104).

図５は、辞書編集処理フローであり、図１の辞書編集装置１がステップS101において実行する辞書編集処理を示す。辞書編集装置１が、全担当者の全文書ファイル６を読み込み（ステップS111）、全文書について前処理を行う（ステップS112）。即ち、全文書から、その改行とスペースとを除去し、残りの部分について周知の形態素解析を行い、分かち書きした単語を得る。そして、当該得られた全単語から不要な品詞の単語を削除する。次に、辞書編集装置１が、図２（Ｃ）に示す担当者・単語テーブルを生成し（ステップS113）、これに基づいてtf・idf 値を算出して（ステップS114）、当該単語及びtf・idf 値をtf・idf 辞書（重要性辞書）４１に書き込む（ステップS115）。これにより、tf・idf 辞書４１が作成される。担当者・単語テーブル及びその生成については、図６を参照して後述する。次に、辞書編集装置１が、図８に示す単語の組み合わせテーブルを生成し（ステップS116）、これに基づいてidf/conf値を算出して（ステップS117）、当該単語とその組み合わせ及びidf/conf値をidf/conf辞書（同時出現性辞書）４２に書き込む（ステップS118）。これにより、idf/conf辞書４２が作成される。単語の組み合わせテーブル及びその生成については、図７を参照して後述する。 FIG. 5 is a dictionary editing process flow, and shows the dictionary editing process executed by the dictionary editing apparatus 1 of FIG. 1 in step S101. The dictionary editing device 1 reads all document files 6 of all persons in charge (step S111), and preprocesses all documents (step S112). That is, the line break and space are removed from the entire document, and the remaining part is subjected to a well-known morphological analysis to obtain a separated word. Then, unnecessary part-of-speech words are deleted from all the obtained words. Next, the dictionary editing apparatus 1 generates a person-in-charge / word table shown in FIG. 2C (step S113), calculates tf / idf values based on the table (step S114), and calculates the word and tf The idf value is written into the tf / idf dictionary (importance dictionary) 41 (step S115). As a result, the tf / idf dictionary 41 is created. The person-in-charge / word table and its generation will be described later with reference to FIG. Next, the dictionary editing apparatus 1 generates a word combination table shown in FIG. 8 (step S116), calculates an idf / conf value based on the table (step S117), and calculates the word, its combination, and idf / The conf value is written into the idf / conf dictionary (co-occurrence dictionary) 42 (step S118). Thereby, the idf / conf dictionary 42 is created. The word combination table and its generation will be described later with reference to FIG.

図６は、担当者・単語テーブル生成処理フローであり、辞書編集装置１がステップS113において実行する担当者・単語テーブル生成処理を示す。辞書編集装置１が、ｄに最初の担当者を設定し（ステップS121）、図２（Ｃ）の担当者・単語テーブルの当該列にｄを追加しその要素tf(t，d) を全て「０」に初期化し（ステップS122）、ｔに最初の単語を設定し（ステップS123）、担当者・単語テーブルの行t1、t2、t3、・・にｔが存在するか否かを調べる（ステップS124）。存在しない場合、辞書編集装置１は、担当者・単語テーブルの行にｔを追加し、その要素tf(t，d) を全て「０」に初期化する（ステップS125）。存在する場合、辞書編集装置１は、ステップS125を省略し、当該担当者ｄ及び単語ｔに対応する要素に「１」を加算する（ステップS126）。即ち、tf(t，d) ＝tf(t，d) ＋１を実行する。この後、辞書編集装置１は、担当者ｄの全単語について処理が終了したか否かを調べ（ステップS127）、終了していない場合、ｔに次の単語を設定し（ステップS128）、ステップS124以下を繰り返す。終了している場合、辞書編集装置１は、全担当者について処理が終了したか否かを調べ（ステップS129）、終了していない場合、ｄに次の担当者を設定し（ステップS1210 ）、ステップS122以下を繰り返す。終了している場合、当該処理を終了する。これにより、担当者・単語テーブルが生成される。 FIG. 6 is a flowchart of the person-in-charge / word table generation process, and shows the person-in-charge / word table generation process executed by the dictionary editing apparatus 1 in step S113. The dictionary editing apparatus 1 sets the first person in charge to d (step S121), adds d to the relevant column of the person in charge / word table in FIG. 2C, and sets all the elements tf (t, d) to “ Is initialized to 0 (step S122), the first word is set to t (step S123), and it is checked whether or not t exists in rows t1, t2, t3,. S124). If it does not exist, the dictionary editing apparatus 1 adds t to the line of the person-in-charge / word table, and initializes all the elements tf (t, d) to “0” (step S125). If it exists, the dictionary editing apparatus 1 omits step S125 and adds “1” to the elements corresponding to the person in charge d and the word t (step S126). That is, tf (t, d) = tf (t, d) +1 is executed. Thereafter, the dictionary editing apparatus 1 checks whether or not the processing has been completed for all the words of the person in charge d (step S127), and if not, sets the next word to t (step S128), and step Repeat S124 and subsequent steps. If it has been completed, the dictionary editing apparatus 1 checks whether or not processing has been completed for all persons in charge (step S129). If it has not been completed, the next person in charge is set in d (step S1210). Step S122 and subsequent steps are repeated. If the process has been completed, the process ends. As a result, a person-in-charge / word table is generated.

今、tf・idf 値は、式（１）のように定義される。即ち、 Now, the tf · idf value is defined as in equation (1). That is,

ここで、右辺第１項のtf(t，d) は、担当者ｄが作成した全ての文書（文書ファイル６）中における単語ｔの出現回数を表す。従って、右辺第１項は同一の担当者が何度も繰り返して使用する単語に大きい重みを与える。例えば、図２（Ｃ）において、担当者d1における単語t1の出現回数tf(t1，d1)＝１である。右辺第１項の分母は担当者ｄが使用した単語の頻度の総和を表している。例えば、図２（Ｃ）において、担当者d1についての当該値Σtf(t'，d1)＝８５３８である。右辺第２項のＮは分類先である全ての担当者数を、df(t)は単語ｔを使用した担当者数を表す。従って、右辺第２項は特定少数の担当者が使用する単語に大きな重みを与え、各担当者を特徴付ける特定性の指標となる。以上から、担当者・単語テーブルに基づいて、tf値（tf_t ^d）、idf 値（idf_t ^d）、tf・idf 値を算出することができる。 Here, tf (t, d) in the first term on the right side represents the number of appearances of the word t in all documents (document file 6) created by the person in charge d. Therefore, the first term on the right side gives a large weight to a word used repeatedly by the same person in charge. For example, in FIG. 2C, the number of appearances tf (t1, d1) = 1 of the word t1 in the person in charge d1. The denominator of the first term on the right side represents the sum of the frequencies of words used by the person in charge d. For example, in FIG. 2C, the value Σtf (t ′, d1) = 8538 for the person in charge d1. N in the second term on the right side represents the number of all persons in charge as classification destinations, and df (t) represents the number of persons in charge using the word t. Therefore, the second term on the right side gives a large weight to the words used by a small number of persons in charge, and serves as an index of specificity that characterizes each person in charge. From the above, the tf value (tf _t ^d ), idf value (idf _t ^d ), and tf · idf value can be calculated based on the person-in-charge / word table.

図７は、単語の組み合わせテーブル生成処理フローであり、辞書編集装置１がステップS116において実行する単語の組み合わせテーブル生成処理を示す。辞書編集装置１が、最小支持度及び最小確信度を読み込み（ステップS131）、ｄに最初の担当者を設定し（ステップS132）、句点の検索をしてセンテンス集合Ｓ^d を生成し（ステップS133）、最小支持度とセンテンス集合Ｓ^d の全センテンス数とからＬＳＣを算出する（ステップS134）。次に、辞書編集装置１が、単語の組み合わせ数を「１」とし、即ち、単語組み合わせテーブル（以下、同じ）Ｃ＝Ｃ１とし（ステップS135）、（Ｃの単語の組み合わせ）＝（センテンス集合Ｓ^d から抽出した各センテンスにおける重複のない単語）とし（ステップS136）、Ｃのカウント値（Count ）をＣの単語の組み合わせが出現するセンテンス集合Ｓ^dのセンテンス数とする（ステップS137）。次に、辞書編集装置１が、Ｃにおける単語の組み合わせの全てについて、当該組み合わせについてのＣのカウント値がＬＳＣ以上であるか否かを調べ、そうでない場合にはＣから当該単語の組み合わせを除き、そうである場合にはＣから当該単語の組み合わせを除かないようにする（ステップS138）。この後、辞書編集装置１が、単語の組み合わせ数が「２」か否かを調べる（ステップS139）。「２」でない場合、即ち、単語の組み合わせ数が「1」の場合、辞書編集装置１が、単語の組み合わせ数を「２」とし、即ち、Ｃ＝Ｃ２とし（ステップS1310 ）、（Ｃの単語の組み合わせ）＝（Ｃ１から生成した２単語の組み合わせ）とし（ステップS1311 ）、ステップS137以下を繰り返す。 FIG. 7 is a word combination table generation processing flow, and shows the word combination table generation processing executed by the dictionary editing apparatus 1 in step S116. Dictionary editing apparatus 1 reads the minimum support and minimum confidence (step S131), set the first person to d (step S132), generates a sentence set S ^d by the search period (step S133 ), and it calculates the LSC and a minimum total sentence number of support of the sentence set S ^d (step S134). Next, the dictionary editing apparatus 1 sets the number of word combinations to “1”, that is, sets the word combination table (hereinafter the same) C = C1 (step S135), (combination of C words) = (sentence set S duplicate without words) and in each sentence extracted from ^d (step S136), the sentence number of the sentence set S ^d the count value of the C (count) a combination of words C appears (step S137). Next, the dictionary editing apparatus 1 checks whether or not the count value of C for the combination is greater than or equal to LSC for all the combinations of words in C, and if not, removes the combination of the words from C If so, the combination of the words is not removed from C (step S138). Thereafter, the dictionary editing apparatus 1 checks whether the number of word combinations is “2” (step S139). When it is not “2”, that is, when the number of word combinations is “1”, the dictionary editing apparatus 1 sets the number of word combinations to “2”, that is, C = C2 (step S1310). Combination) = (combination of two words generated from C1) (step S1311), and step S137 and subsequent steps are repeated.

ステップS139において単語の組み合わせ数が「２」である場合、辞書編集装置１が、t1をＣ２の単語の組み合わせの第１単語とし（ステップS1312）、t1のカウント値（t1＿Count）をＣ１のt1のカウント値（Count）とし（ステップS1313 ）、最小確信度とt1のＣ１のカウント値とからＬＣＣを算出する（ステップS1314 ）。次に、辞書編集装置１が、Ｃ２における単語の組み合わせの全てについて、当該組み合わせについてのＣ２のカウント値がＬＣＣ以上か否かを調べ、そうでない場合にはＣ２から当該単語の組み合わせを除き、そうである場合にはＣ２から当該単語の組み合わせを除かないようにする（ステップS1315 ）。この後、辞書編集装置１が、全担当者について処理済みか否かを調べ（ステップS1316 ）、そうでない場合、ｄに次の担当者を設定し（ステップS1317 ）、ステップS133以下を繰り返す。全担当者について処理済みである場合、処理を終了する。 When the number of word combinations is “2” in step S139, the dictionary editing apparatus 1 sets t1 as the first word of the C2 word combination (step S1312), and sets the count value (t1_Count) of t1 to t1 of C1. The count value (Count) is set (step S1313), and the LCC is calculated from the minimum certainty factor and the count value of C1 of t1 (step S1314). Next, the dictionary editing apparatus 1 checks whether or not the count value of C2 for the combination is greater than or equal to LCC for all the combinations of words in C2, and if not, removes the combination of the words from C2, and so on If not, the word combination is not removed from C2 (step S1315). Thereafter, the dictionary editing apparatus 1 checks whether or not all persons in charge have been processed (step S1316). If not, the next person in charge is set in d (step S1317), and step S133 and subsequent steps are repeated. If all the persons in charge have been processed, the process ends.

ここで、idf/conf値は、相関ルールの抽出に用いられる周知の確信度に基づいている。相関ルールの手法を本発明に応用するために、担当者ｄが使用した単語を要素とする集合をＴ^d ＝（t1，t2，t3，・・，tm）とする。また、担当者ｄが作成した文書から抽出したセンテンス(句点で区切られる1文)の集合をＳ^d＝（s1，s2，s3，・・sn）（si⊆Ｔ^d ）とする。ここで、単語ｔの支持度support(t)はＳ^d 全体に対しｔを含むセンテンスの割合を表す。また、相関ルールはｔ⇒ｔ’で表現され、単語ｔが出現したセンテンスには単語ｔ’が出現する確率が高いこと、即ちｔとｔ’の共起性が高いことを表す。相関ルールは支持度(support)及び確信度(confidence)の２つのパラメータを有し、これらの値により相関ルールの有意性を示す。ここで、support(t⇒t' )はＳ^d 全体に対しｔとｔ’を共に含むセンテンスの割合、confidence(t⇒t' )はｔを含むセンテンスの中でｔ’を含むセンテンスの割合と定義されている。 Here, the idf / conf value is based on a well-known certainty factor used for extracting an association rule. In order to apply the method of the association rule to the present invention, let T ^d = (t1, t2, t3,..., Tm ⁾ be a set having the words used by the person in charge d as elements. Also, let S ^d = (s1, s2, s3,... Sn) (si⊆T ^d ) be a set of sentences (one sentence delimited by the punctuation marks) extracted from the document created by the person in charge d. Here, the support level (t) of the word t represents the ratio of sentences including t to the entire S ^d . The association rule is expressed as t⇒t ′, which indicates that the probability that the word t ′ appears in the sentence in which the word t appears is high, that is, the co-occurrence of t and t ′ is high. The association rule has two parameters, support and confidence, and these values indicate the significance of the association rule. Here, support (t⇒t ') is t and t with respect to the entire S ^d' and percentage of sentences percentage sentences containing both, confidence (t⇒t ') is the t in the sentence that contains the t' containing Is defined.

本発明では、図８に示すように、最小支持度と最小確信度を設定して、共起性の高い単語の組み合わせを求め、この共起性を表す重みとして確信度を用いる。この例では、最小支持度及び最小確信度を、各々、０．３及び０．７とした。これらの値は経験的に定めることができる。なお、索引語抽出の研究で良く知られているように、頻度の低い単語は不要語であるが、頻度が上位の単語も特徴語ではなく一般語であることが多く、不要語となる。ここでも同様に、最小支持度と最小確信度を満足しない単語の組み合わせは不要であり、辞書４には登録しない。これと同時に、支持度もしくは確信度が上位の組み合わせは一般語の組み合わせとなり不要である。そこで、上述のように最小支持度と最小確信度を満足した単語の組み合わせの共起性を表す具体的な重みとして、確信度の逆数に単語の特定性を考慮して第１単語のidf 値を積算した値を用いる。なお、更に、第１単語に加えて第２単語のidf 値も積算した値を重みとしても良いし、確信度の代わりに支持度を用いても良い。 In the present invention, as shown in FIG. 8, a minimum support level and a minimum confidence level are set, a combination of words having high co-occurrence is obtained, and the confidence level is used as a weight representing the co-occurrence level. In this example, the minimum support level and the minimum confidence level were set to 0.3 and 0.7, respectively. These values can be determined empirically. As is well known in research on index word extraction, infrequent words are unnecessary words, but words with higher frequencies are often general words, not feature words, and are unnecessary words. Similarly, a combination of words that does not satisfy the minimum support level and the minimum certainty level is unnecessary and is not registered in the dictionary 4. At the same time, combinations with higher support or certainty are common word combinations and are not necessary. Therefore, as a specific weight representing the co-occurrence of a combination of words satisfying the minimum support level and the minimum confidence level as described above, the idf value of the first word in consideration of the specificity of the word as the reciprocal of the confidence level The value obtained by integrating is used. Furthermore, a value obtained by integrating the idf value of the second word in addition to the first word may be used as the weight, or the support level may be used instead of the certainty factor.

例えば、図８に示すように、担当者ｄのセンテンス集合Ｓ^dが求まるとする。単語の組み合わせを１個とすると、各々の出現回数Count が求まる（最初のＣ１）。これから、（最小支持度）×（Ｓ^dのセンテンス数）＝０．３×４よりもCount 値の小さい単語｛単語４｝を除く（２番目のＣ１）。これにより、各々の単語についての最小支持度を満足する値が定まる。次に、残りの単語について出現する２個の単語の組み合わせの全てについて、各々の出現回数Count を求め（最初のＣ２）、これから、０．３×４よりもCount 値の小さい単語の組み合わせ｛単語１，単語２｝及び｛単語１，単語５｝を除く（２番目のＣ２）。次に、残りの単語の組み合わせについて、（最小確信度）×（各々の第１単語についての最小支持度を満足する値）よりもCount 値の小さい単語の組み合わせ｛単語２，単語３｝及び｛単語３，単語５｝を除く（３番目のＣ２）。即ち、｛単語２，単語３｝については２（Ｃ２におけるCount 値、以下同じ）＜０．７×３（２番目のＣ１の単語２のCount 値、以下同じ）であり、｛単語３，単語５｝については２＜０．７×３であり、除かれる。一方、｛単語１，単語３｝については２＞０．７×２であり、｛単語２，単語５｝については３＞０．７×３であり、残される。 For example, as shown in FIG. 8, it is assumed that the sentence set S ^{d of the} person in charge ^d is obtained. If the number of word combinations is one, each occurrence count Count is obtained (first C1). Now, (minimum support) × except small words {word 4} of Count value than (S ^d number sentences) = 0.3 × 4 (2-th C1). Thus, a value satisfying the minimum support level for each word is determined. Next, for all combinations of two words appearing for the remaining words, the number of occurrences of each count is obtained (first C2), and from this, a combination of words having a count value smaller than 0.3 × 4 {word 1, word 2} and {word 1, word 5} are removed (second C2). Next, for the remaining word combinations, combinations {words 2, words 3} and {words having a Count value smaller than (minimum confidence) × (value satisfying the minimum support for each first word) Except word 3, word 5} (third C2). That is, {word 2, word 3} is 2 (Count value in C2, the same applies hereinafter) <0.7 × 3 (Count value of second C1 word 2 is applied), {word 3, word For 5}, 2 <0.7 × 3, which is excluded. On the other hand, 2> 0.7 × 2 for {word 1, word 3}, and 3> 0.7 × 3 for {word 2, word 5}, which remain.

なお、本発明者の検討によれば、単語の組み合わせ数を３以上にしても、メール８（文書）の分類精度は向上しないことが判った。従って、この例においては、計算量の低減のために、単語の組み合わせは「２」に制限される。従って、この例では、同時出現性辞書４２は、第１単語と第２単語の組み合わせ毎にidf/conf値を格納する。 According to the study of the present inventor, it has been found that the classification accuracy of the mail 8 (document) is not improved even if the number of word combinations is 3 or more. Accordingly, in this example, word combinations are limited to “2” in order to reduce the amount of calculation. Therefore, in this example, the co-occurrence dictionary 42 stores an idf / conf value for each combination of the first word and the second word.

今、単語ｔが出現したとき単語ｔ’が共起する指標であるidf/conf値を式（２）で定義する。 Now, an idf / conf value, which is an index with which the word t ′ co-occurs when the word t appears, is defined by equation (2).

ここで、右辺第2項の分子はある担当者ｄにおけるconfidence値（即ち、確信度）の最大値（max ）であり、担当者毎に大きさの異なるconfidence値を標準化している。以上から、図８の単語の組み合わせテーブルに基づいて、conf値、idf/conf値を算出することができる。conf値の定義は、式（２）の両辺よりidf_t ^d を除くことにより明らかであろう。 Here, the numerator of the second term on the right side is the maximum value (max) of the confidence value (that is, the certainty factor) for a certain person in charge d, and the confidence value having a different size for each person in charge is standardized. From the above, the conf value and idf / conf value can be calculated based on the word combination table of FIG. The definition of the conf value will be clear by removing idf _t ^d from both sides of equation (2).

図９は、メール（問い合わせメール）解析処理フローであり、図１のメール解析装置２がステップS103及びS104において実行するメール解析処理を示す。メール解析装置２が、処理対象である受信したメール（問い合わせメール）８を読み込み（ステップS141）、これについて、その改行とスペースとを除去し、残りの部分について周知の形態素解析を行い（ステップS142）、分かち書きした単語を得る。次に、メール解析装置２が、ｄに最初の担当者を設定し（ステップS143）、t1にメール８の最初の単語を設定し（ステップS144）、tf・idf 値重み付き加算を行い（ステップS145）、idf/conf値重み付き加算を行い（ステップS146）、メール８の全単語について処理を終了した否かを調べる（ステップS147）。tf・idf 値重み付き加算については図１０を参照して後述し、idf/conf値重み付き加算については図１１を参照して後述する。 FIG. 9 is a flow of mail (inquiry mail) analysis processing, and shows mail analysis processing executed by the mail analysis apparatus 2 of FIG. 1 in steps S103 and S104. The mail analysis device 2 reads the received mail (inquiry mail) 8 to be processed (step S141), removes the line feed and space, and performs a known morphological analysis on the remaining part (step S142). ), Get the word you shared. Next, the mail analysis device 2 sets the first person in charge to d (step S143), sets the first word of the mail 8 to t1 (step S144), and performs tf · idf value weighted addition (step S144). S145), idf / conf value weighted addition is performed (step S146), and it is checked whether or not the processing has been completed for all words in the mail 8 (step S147). The tf · idf value weighted addition will be described later with reference to FIG. 10, and the idf / conf value weighted addition will be described later with reference to FIG.

全単語について処理を終了していない場合、メール解析装置２は、t1にメール８の次の単語を設定し（ステップS148）、ステップS145以下を繰り返す。全単語について処理を終了している場合、メール解析装置２は、担当者ｄのスコアを当該メール８の全単語についてのスコアの総計、即ち、Score(d)＝ΣScore(d，t1)として算出し（ステップS149）、全担当者について処理を終了した否かを調べる（ステップS1410 ）。全担当者について処理を終了していない場合、メール解析装置２は、ｄに次の担当者を設定し（ステップS1411 ）、ステップS144以下を繰り返す。全担当者について処理を終了している場合、メール解析装置２は、全担当者についてのScore(d)の平均及び標準偏差を算出して、これに基づいて、当該メール８に回答すべき担当者（回答者）の候補を決定し（ステップS1412 ）、これを表示する（ステップS1413 ）。 If the processing has not been completed for all words, the mail analysis device 2 sets the next word of the mail 8 to t1 (step S148), and repeats step S145 and subsequent steps. If the processing has been completed for all words, the mail analysis device 2 calculates the score of the person in charge d as the sum of the scores for all the words in the mail 8, that is, Score (d) = ΣScore (d, t1) Then, it is checked whether or not the processing has been completed for all persons in charge (step S1410). If the processing has not been completed for all the persons in charge, the mail analyzing apparatus 2 sets the next person in charge for d (step S1411), and repeats step S144 and subsequent steps. When the processing is completed for all persons in charge, the mail analysis device 2 calculates the average and standard deviation of Score (d) for all persons in charge, and based on this, the person in charge who should reply to the mail 8 Candidates (respondents) are determined (step S1412) and displayed (step S1413).

図１０は、tf・idf 値重み付き加算処理フローであり、メール解析装置２がステップS145において実行するtf・idf 値重み付き加算処理を示す。メール解析装置２が、Max＿tf・idf ＿Ratio に「０」を設定（代入）し、Score(d，t1)に「０」を設定し、ｍを読み込み（ステップS151）、t＿dic に担当者ｄのtf・idf 辞書４１の最初の単語を設定し（ステップS152）、t1とt＿dic とが部分一致するか否かを調べる（ステップS153）。部分一致する場合、メール解析装置２が、tf・idf にt＿dic のtf・idf 値を設定し、Matched＿Ratio に、t1とt＿dic との一致率のｍ乗を設定し、tf・idf とMatched＿Ratio とからtf・idf ＿Ratio を算出し（ステップS154）、tf・idf ＿Ratio がMax ＿tf・idf ＿Ratio よりも大きいか否かを調べる（ステップS155）。大きい場合、メール解析装置２が、Max ＿tf・idf ＿Ratio にtf・idf ＿Ratio を設定し（ステップS156）、担当者ｄのtf・idf 辞書４１の全単語について処理が終了したか否かを調べる（ステップS157）。 FIG. 10 is a tf · idf value weighted addition process flow, and shows the tf · idf value weighted addition process executed by the mail analysis device 2 in step S145. The mail analysis device 2 sets (substitutes) Max_tf · idf_Ratio to “0”, sets “0” to Score (d, t1), reads m (step S151), and sets t_dic to tf of the person in charge d The first word of the idf dictionary 41 is set (step S152), and it is checked whether or not t1 and t_dic partially match (step S153). If there is a partial match, the mail analysis device 2 sets the tf · idf value of t_dic to tf · idf, sets the mth power of the match rate between t1 and t_dic to Matched_Ratio, and sets tf · idf and Matched_Ratio to tf · idf. Idf_Ratio is calculated (step S154), and it is checked whether tf · idf_Ratio is larger than Max_tf · idf_Ratio (step S155). If it is larger, the mail analysis device 2 sets tf · idf_Ratio to Max_tf · idf_Ratio (step S156), and checks whether or not processing has been completed for all words in the tf · idf dictionary 41 of the person in charge d (step S156). Step S157).

全単語について処理が終了していない場合、t＿dic に担当者ｄのtf・idf 辞書４１の次の単語を設定し（ステップS158）、ステップS153以下を繰り返す。ステップS153においてt1とt＿dic とが部分一致しない場合、ステップS154〜ステップS156を省略して、ステップS157を実行する。ステップS155においてtf・idf ＿Ratio がMax ＿tf・idf ＿Ratio よりも大きくない場合、ステップS156を省略して、ステップS157を実行する。ステップS157において全単語について処理が終了している場合、Score(d，t1)にMax ＿tf・idf ＿Ratio を設定して処理を終了する（ステップS159）。 If the processing has not been completed for all words, the next word in the tf / idf dictionary 41 of the person in charge d is set in t_dic (step S158), and step S153 and subsequent steps are repeated. If t1 and t_dic do not partially match in step S153, step S154 to step S156 are omitted, and step S157 is executed. If tf · idf_Ratio is not greater than Max_tf · idf_Ratio in step S155, step S156 is omitted and step S157 is executed. When the processing has been completed for all the words in step S157, Max_tf · idf_Ratio is set in Score (d, t1) and the processing is terminated (step S159).

図１１は、idf/conf値重み付き加算処理フローであり、メール解析装置２がステップS145において実行するidf/conf値重み付き加算処理を示す。メール解析装置２が、Max ＿idf/conf ＿Ratioに「０」を設定し、ｍを読み込み（ステップS161）、t＿dic1に担当者ｄのidf/conf辞書４２の最初の第１単語を設定し、t＿dic2に担当者ｄのidf/conf辞書４２の最初の第２単語を設定し（ステップS162）、t1とt＿dic1とが部分一致するか否かを調べる（ステップS163）。部分一致する場合、メール解析装置２が、t2にメール８のt1の次の単語を設定し（ステップS164）、t2とt＿dic2とが部分一致するか否かを調べる（ステップS165）。部分一致する場合、メール解析装置２が、idf/confにt＿dic1とt＿dic2のidf/conf値を設定し、Matched ＿Ratio1にt1とt＿dic1の一致率のｍ乗を設定し、Matched ＿Ratio2にt2とt＿dic2の一致率のｍ乗を設定し、idf/confとMatched ＿Ratio1とMatched＿Ratio2とに基づいてidf/conf＿Ratio を算出する（ステップS166）。 FIG. 11 is an idf / conf value weighted addition processing flow, and shows the idf / conf value weighted addition processing executed by the mail analysis device 2 in step S145. The mail analysis device 2 sets “0” to Max_idf / conf_Ratio, reads m (step S161), sets the first first word of the idf / conf dictionary 42 of the person in charge d to t_dic1, and sets t_dic2. The first second word in the idf / conf dictionary 42 of the person in charge d is set (step S162), and it is checked whether or not t1 and t_dic1 partially match (step S163). If there is a partial match, the mail analysis device 2 sets the next word after t1 of the mail 8 to t2 (step S164), and checks whether t2 and t_dic2 partially match (step S165). If there is a partial match, the mail analysis device 2 sets the idf / conf values of t_dic1 and t_dic2 in idf / conf, sets the m-th power of the match ratio of t1 and t_dic1 in Matched_Ratio1, and sets t2 and t_dic2 in Matched_Ratio2 The match power of m is set, and idf / conf_Ratio is calculated based on idf / conf, Matched_Ratio1, and Matched_Ratio2 (step S166).

ステップS165においてt2とt＿dic2とが部分一致しない場合、メール解析装置２が、メール８の最後の単語まで処理済か否かを調べ（ステップS1611 ）、最後の単語まで処理を終了していない場合、t2にメール８のt2の次の単語を設定し（ステップS1612 ）、ステップS165以下を繰り返す。 If t2 and t_dic2 do not partially match in step S165, the mail analysis apparatus 2 checks whether or not the last word of the mail 8 has been processed (step S1611), and if the process has not ended up to the last word, The next word after t2 of the mail 8 is set in t2 (step S1612), and step S165 and subsequent steps are repeated.

ステップS166の後、メール解析装置２が、idf/conf＿Ratio がMax ＿idf/conf＿Ratio よりも大きいか否かを調べ（ステップS167）、大きい場合、Max ＿idf/conf＿Ratio にidf/conf ＿Ratio を設定して（ステップS168）、担当者ｄのidf/conf辞書４２の全単語について処理済か否かを調べる（ステップS169）。全単語について処理を終了していない場合、メール解析装置２が、t＿dic1に担当者ｄのidf/conf辞書４２の次の第１単語を設定し、t＿dic2に担当者ｄのidf/conf辞書４２の次の第２単語を設定し（ステップS1610 ）、ステップS163以下を繰り返す。全単語について処理を終了している場合、メール解析装置２が、定数αを読み込み（ステップS1613 ）、Score(d，t1)にαとMax ＿idf/conf＿Ratio とにより求まる値を加算することによりScore(d，t1)を算出する（ステップS1614 ）。 After step S166, the mail analysis device 2 checks whether idf / conf_Ratio is larger than Max_idf / conf_Ratio (step S167). If so, sets idf / conf_Ratio to Max_idf / conf_Ratio (step S167). S168), it is checked whether or not all the words in the idf / conf dictionary 42 of the person in charge d have been processed (step S169). If processing has not been completed for all words, the mail analysis device 2 sets the first word next to the idf / conf dictionary 42 of the person in charge d to t_dic1, and sets the idf / conf dictionary 42 of the person in charge d to t_dic2. The next second word is set (step S1610), and the steps after step S163 are repeated. If the processing has been completed for all words, the mail analysis device 2 reads the constant α (step S1613), and adds the value obtained by α and Max_idf / conf_Ratio to Score (d, t1). d, t1) are calculated (step S1614).

ステップS163においてt1とt＿dic1とが部分一致しない場合、メール解析装置２が、ステップS164〜S168（S1611 及びS1612 を含む）を省略して、ステップS169を実行する。ステップS167においてidf/conf＿Ratio がMax ＿idf/conf＿Ratio よりも大きくない場合、メール解析装置２が、ステップS168を省略して、ステップS169を実行する。ステップS1611 においてメール８の最後の単語まで処理を終了している場合、メール解析装置２が、ステップS1612 （S166〜S168を含む）を省略して、ステップS169を実行する。 If t1 and t_dic1 do not partially match in step S163, the mail analysis device 2 omits steps S164 to S168 (including S1611 and S1612) and executes step S169. When idf / conf_Ratio is not larger than Max_idf / conf_Ratio in step S167, the mail analysis device 2 omits step S168 and executes step S169. If the process has been completed up to the last word of the mail 8 in step S1611, the mail analysis apparatus 2 omits step S1612 (including S166 to S168) and executes step S169.

このように、本発明では、メール８中の全ての出現単語を辞書４と照合して、メール８に対する各担当者のスコアを算出する。このとき、形態素解析により分割された単語の単語長が変化することがある。また、質問者と担当者とが同一の単語を使用する保証はない。このため、メール８に含まれる単語と辞書４中の単語を完全一致を条件に照合するとマッチする単語が少なくなり、結局は分類の精度が向上しない。そこで、本発明では、単語の照合は部分一致によって行い、この一致率を単語の重みに乗ずることによる重み付き加算を行って、メール８に対する担当者ｄのスコアを算出する。 Thus, in the present invention, all the appearance words in the mail 8 are checked against the dictionary 4 to calculate the score of each person in charge for the mail 8. At this time, the word length of the words divided by the morphological analysis may change. Also, there is no guarantee that the questioner and the person in charge will use the same word. For this reason, if the words included in the mail 8 and the words in the dictionary 4 are collated on the condition of complete matching, the number of matching words decreases, and eventually the classification accuracy does not improve. Therefore, in the present invention, word matching is performed by partial matching, and weighted addition is performed by multiplying the matching rate by the weight of the word to calculate the score of the person in charge d for the mail 8.

まず、メール８中のある単語ｔのtf・idf 値とidf/conf値を加味したスコアを式（３）で定義する。なお、単語ｔ及びこれと共起性を示す単語ｔ’は、ともに上記の部分一致のために辞書４中の複数の単語又は単語の組み合わせに照合する可能性があるので、このうち、各々tf・idf 値とidf/conf値に一致率を乗じた値が最大値となるものを加算する。 First, a score that takes into account the tf · idf value and idf / conf value of a word t in the mail 8 is defined by equation (3). Note that the word t and the word t ′ indicating co-occurrence with the word t may collate with a plurality of words or combinations of words in the dictionary 4 for the above partial matching.・ Add the idf value and idf / conf value that is the maximum value obtained by multiplying the match rate.

ここで、Match ＿Ratio (t)は単語の一致率、即ちメール８中の単語ｔと辞書４中の単語の単語長に対する一致した文字数の比率を表し、式（４）で定義される。 Here, Match_Ratio (t) represents the word matching rate, that is, the ratio of the number of matched characters to the word length of the word t in the mail 8 and the word in the dictionary 4, and is defined by the equation (4).

ここで、ｔ_dicは辞書４中の単語を表し、Length(t)は単語ｔの文字数、Match ＿Length(t，t_dic)はｔとｔ_dicの一致した文字数を表している。また、式（３）のαは、第１項と第２項の重みを決める係数である。更に、式（４）のｍは単語の一致率をどのオーダでスコアに反映するかを表す。この一致率は1以下であるため、ｍを大きくするほどこの率によってスコアに差が生じる。なお、このαとｍのいずれも適当な範囲で変動させて、経験則により、最大の分類精度を得るαとｍを採用することができる。 Here, t _dic represents a word in the dictionary 4, Length (t) represents the number of characters in the word t, and Match_Length (t, t _dic ) represents the number of characters in which t and t _dic match. Further, α in Expression (3) is a coefficient that determines the weight of the first term and the second term. Further, m in the formula (4) represents in which order the word matching rate is reflected in the score. Since this coincidence rate is 1 or less, the larger the value of m, the more the score varies with this rate. It should be noted that both α and m can be varied within an appropriate range, and α and m for obtaining the maximum classification accuracy can be employed based on empirical rules.

最後に、式（３）で与えられる単語ｔのスコアScore_t ^dについて、メール８本文に出現する単語で総和を取ることにより、入力されたメール８に対する担当者ｄのスコアを式（５）で定義する。 Finally, the score Score _t ^d of term t given by equation (3), by taking the sum with words that appear in the mail 8 body, a score representative d to the mail 8 entered by the formula (5) Define.

ここで、入力したメール８に対する回答者は、全ての担当者のScore^dの分布を正規分布と仮定して、その平均スコアより２σ（σは標準偏差）以上大きいスコアを持つ担当者と推定する。これは、実際のメール８において、特に問い合わせ内容を限定しない場合は、複数の事柄について複合して質問していることや複数の専門分野にまたがって質問していることが多く、複数の担当者が連携して回答する必要があるためである。 Here, the respondent to the input mail 8 assumes that the distribution of Score ^d of all the persons in charge is a normal distribution, and assumes that the person in charge has a score that is 2σ (σ is a standard deviation) or more than the average score. . This is because, in the actual mail 8, when the inquiry content is not particularly limited, there are many cases where multiple questions are asked in combination with multiple matters or questions across multiple specialized fields. This is because it is necessary to answer in cooperation.

図１２は、強化学習処理フローであり、主として図１の強化学習装置３がステップS102において実行する単語の重みの強化学習処理を示す。強化学習装置３が、別文書ファイル７を読み込み（ステップS171）、真の回答者集合を抽出する（ステップS172）。なお、別文書ファイル７には、例えば学習用メール８が含まれる（以下同じ）。メール解析装置２が、別文書ファイル７について前述のようにメール解析を実行し、図３の結果を得る。強化学習装置３が、当該結果をメール解析結果テーブルとして取得し（ステップS173）、メール解析結果テーブル中の全ての回答者について、当該回答者が真の回答者集合に含まれるか否かを調べ（ステップS174）、含まれる場合には当該回答者について重要語学習処理を実行し（ステップS175）、含まれない場合には当該回答者について不要語学習処理を実行する（ステップS176）。なお、実際には、点線で示すように、一人の回答者についてステップS174とS175又はS176とを実行することを、各回答者について繰り返す。この後、強化学習装置３が、メール解析結果テーブル中の全ての非回答者について、当該非回答者が真の回答者集合に含まれるか否かを調べ（ステップS177）、含まれる場合には当該非回答者について特定語学習処理を実行し（ステップS178）、含まれない場合にはステップS178を省略する。なお、実際には、点線で示すように、一人の非回答者についてステップS177又はS177及びS178を実行することを、各非回答者について繰り返す。この後、強化学習装置３が、全ての別文書ファイル７について処理済か否かを調べ（ステップS179）、処理済でない場合にはステップS171以下を繰り返し、処理済である場合には処理を終了する。 FIG. 12 is a flowchart of reinforcement learning processing, and mainly shows the word weight reinforcement learning processing executed in step S102 by the reinforcement learning device 3 of FIG. The reinforcement learning device 3 reads the separate document file 7 (step S171), and extracts a true answerer set (step S172). The separate document file 7 includes, for example, a learning mail 8 (the same applies hereinafter). The mail analysis device 2 performs mail analysis on the separate document file 7 as described above, and obtains the result of FIG. The reinforcement learning device 3 acquires the result as an email analysis result table (step S173), and checks whether the respondent is included in the true answerer set for all the respondents in the email analysis result table. (Step S174) When it is included, the important word learning process is executed for the respondent (step S175), and when it is not included, the unnecessary word learning process is executed for the respondent (step S176). In practice, as indicated by the dotted line, the execution of steps S174 and S175 or S176 for one respondent is repeated for each respondent. Thereafter, the reinforcement learning device 3 checks whether or not the non-respondent is included in the true respondent set for all non-responders in the mail analysis result table (step S177). The specific word learning process is executed for the non-responder (step S178), and if not included, step S178 is omitted. In practice, as indicated by the dotted line, the execution of steps S177 or S177 and S178 for one non-responder is repeated for each non-responder. Thereafter, the reinforcement learning device 3 checks whether or not all the separate document files 7 have been processed (step S179), and if not processed, repeats step S171 and subsequent steps, and ends the processing if processed. To do.

本発明では、別文書ファイル7の解析結果を受けて、ＰＳにより、推定した回答者の正しさから担当者毎の辞書４のtf・idf 値とidf/conf値を更新する。この例では、作成者即ち担当者が判明している別文書ファイル７をメール解析装置２に入力する。この場合、真の文書作成者が判っているので、当該システムの分類結果の正否を判断できる。なお、人間の分類者が参照して別文書ファイル7を分類した結果を正しいとして学習しても良い。 In the present invention, the analysis result of the separate document file 7 is received and the tf · idf value and idf / conf value of the dictionary 4 for each person in charge are updated by the PS based on the estimated correctness of the respondent. In this example, the separate document file 7 whose creator, that is, the person in charge is known is input to the mail analysis device 2. In this case, since the true document creator is known, the correctness of the classification result of the system can be determined. Note that the result of classifying the separate document file 7 with reference to a human classifier may be learned as correct.

本発明では、以下の３種類の強化学習を行う。即ち、重要語の学習は、真の回答者をシステムが回答者であると推定できた場合に、別文書ファイル7とこの回答者の辞書４０で照合した単語を重要語として、その重みを大きくする。また、特定語の学習は、真の回答者をシステムが回答者であると推定できなかった場合に、別文書ファイル7に出現した単語の中で真の回答者について特定性の高い単語を再評価することにより、この回答者の特定語の重みを大きくする。更に、不要語の学習は、真の回答者以外をシステムが回答者であると誤推定した場合に、別文書ファイル7とこの担当者の辞書４０で照合した単語を不要語として、この担当者の辞書４０における不要語の重みを小さくする。これと同時に、この単語を真の回答者以外の他の担当者の辞書４０でも、同様に不要語として学習する。 In the present invention, the following three types of reinforcement learning are performed. That is, in the learning of the important word, when the true respondent can be estimated that the system is the respondent, the word collated in the separate document file 7 and the respondent's dictionary 40 is regarded as the important word, and the weight is increased. To do. In addition, when learning a specific word, if a true respondent cannot be estimated as a respondent, the word that appears in the separate document file 7 is re-examined as a highly specific word for the true respondent. By evaluating, the weight of the specific word of the respondent is increased. Furthermore, unnecessary words are learned in the case where a person other than the true respondent mistakenly estimates that the system is the respondent, and the person who is collated in the separate document file 7 and the dictionary 40 of the person in charge is regarded as an unnecessary word. The weight of unnecessary words in the dictionary 40 is reduced. At the same time, this word is also learned as an unnecessary word in the dictionary 40 of the person in charge other than the true respondent.

図１３は、重要語学習処理フローであり、強化学習装置３がステップS175において実行する重要語の学習処理を示す。強化学習装置３が、定数c1、Ｌ、Ｓを定義ファイルから読み込み（ステップS181）、単語ｔにメール解析結果テーブルにおいて対象回答者の第１位照合単語を設定する（ステップS182）。次に、強化学習装置３が、対象回答者のtf・idf 辞書４１でｔを検索してtf・idf にｔと照合した単語のtf・idf 値を設定し、tf・idf とc1とからf＿tf・idf を算出し（ステップS183）、対象回答者のidf/conf辞書４２の第１単語でｔを検索してidf/confにｔと照合した単語のidf/conf値を設定し、idf/confとc1とからf＿idf/confを算出し（ステップS184）、tf・idf 辞書４１及びidf/conf辞書４２におけるｔのtf・idf 値及びidf/conf値を更新する（ステップS185）。即ち、新たなtf・idf 値を、それまでのtf・idf 値にf＿tf・idf を加算した値とする。新たなidf/conf値を、それまでのidf/conf値にf＿idf/confを加算した値とする。 FIG. 13 is a flowchart of important word learning processing, showing the important word learning processing executed by the reinforcement learning device 3 in step S175. The reinforcement learning device 3 reads the constants c1, L, and S from the definition file (step S181), and sets the first matching word of the target respondent in the mail analysis result table to the word t (step S182). Next, the reinforcement learning device 3 searches for t in the target respondent's tf • idf dictionary 41 and sets tf • idf for the word matched with t in tf • idf, and uses f_tf from tf • idf and c1.・ Idf is calculated (step S183), t is searched with the first word in the target respondent's idf / conf dictionary 42, idf / conf is set to idf / conf and the idf / conf value of the word matched is set to idf / conf F_idf / conf is calculated from c1 and c1 (step S184), and the tf · idf value and idf / conf value of t in the tf · idf dictionary 41 and idf / conf dictionary 42 are updated (step S185). That is, the new tf · idf value is a value obtained by adding f_tf · idf to the previous tf · idf value. The new idf / conf value is the value obtained by adding f_idf / conf to the previous idf / conf value.

次に、強化学習装置３が、ｉ＝２として（ステップS186）、単語ｔにメール解析結果テーブルにおいて対象回答者の第ｉ位照合単語を設定する（ステップS187）。次に、強化学習装置３が、対象回答者のtf・idf 辞書４１でｔを検索してtf・idf にｔと照合した単語のtf・idf 値を設定し、f＿tf・idf とＳとから新たなf＿tf・idf を算出し（ステップS188）、対象回答者のidf/conf辞書４２の第１単語でｔを検索してidf/confにｔと照合した単語のidf/conf値を設定し、f＿idf/confとＳとから新たなf＿idf/confを算出し（ステップS189）、tf・idf 辞書４１及びidf/conf辞書４２におけるｔのtf・idf 値及びidf/conf値を更新する（ステップS1810 ）。即ち、新たなtf・idf 値を、それまでのtf・idf 値にf＿tf・idf を加算した値とし、新たなidf/conf値を、それまでのidf/conf値にf＿idf/confを加算した値とする。この後、強化学習装置３が、ｉがＬに等しいか否かを調べ（ステップS1811 ）、等しくない場合、ｉをｉ＋１とし（ステップS1812 ）、ステップS187以下を繰り返す。等しい場合、ステップS1812 を省略して、処理を終了する。なお、メール8の文章が極めて短いとｉの最大値がＬより小さくなることがあるが、この場合、読み込んだＬに代えて当該ｉの最大値が用いられる。即ち、ステップS1811においてｉが当該ｉの最大値に等しい場合、処理を終了する（以下、ステップS1911及びS2120において同じ）。 Next, the reinforcement learning device 3 sets i = 2 (step S186), and sets the i-th collation word of the target respondent in the mail analysis result table to the word t (step S187). Next, the reinforcement learning device 3 searches for t in the target respondent's tf • idf dictionary 41 and sets tf • idf value of the word matched with t in tf • idf, and newly adds f_tf • idf and S F_tf · idf is calculated (step S188), t is searched with the first word in the target respondent's idf / conf dictionary 42, idf / conf value of the word matched with t is set in idf / conf, and f_idf New f_idf / conf is calculated from / conf and S (step S189), and the tf · idf value and idf / conf value of t in the tf · idf dictionary 41 and idf / conf dictionary 42 are updated (step S1810). That is, the new tf · idf value is the value obtained by adding f_tf · idf to the previous tf · idf value, the new idf / conf value is the value obtained by adding f_idf / conf to the previous idf / conf value And Thereafter, the reinforcement learning device 3 checks whether i is equal to L (step S1811). If it is not equal, i is set to i + 1 (step S1812), and step S187 and subsequent steps are repeated. If equal, step S1812 is omitted and the process ends. Note that if the text of the mail 8 is extremely short, the maximum value of i may be smaller than L. In this case, the maximum value of i is used instead of the read L. That is, if i is equal to the maximum value of i in step S1811, the process is terminated (hereinafter the same in steps S1911 and S2120).

ＰＳでは、エピソード単位にルールに付加された重みを強化する。エピソードとは、初期状態あるいは報酬を得た直後から、次の報酬までのルール系列を表す。この強化には報酬からどれだけ過去かを引数とする強化関数が用いられる。長さlのエピソード（rl，・・ri，・・r2，r1）に対して、ルールｒ_i の重みｗ_i は式（６）のように強化関数ｆ_i で強化される。 In PS, the weight added to the rule is strengthened for each episode. An episode represents a rule sequence from the initial state or immediately after obtaining a reward to the next reward. For this strengthening, a strengthening function with an argument of how far in the past from the reward is used. Episodes of length l with respect to (rl, ·· ri, ·· r2 , r1), the weights w _i of the rule r _i is reinforced with reinforcing function f _i as in Equation (6).

本発明では、ルールは辞書４中の単語又は単語の組み合わせに相当し、エピソードのルール系列は式（５）でスコアScore^d を決定した単語（tl，・・ti，・・t2，t1）である。即ち別文書ファイル7と担当者ｄの辞書４０中で照合した単語である。更に、系列中の順序ｉは式（5）のスコアScore^d への寄与の順、即ち式（３）のScore_t ^dの大きさの順とし、このScore_t ^dが最大となるスコア１位の単語ｔ（第１位照合単語）をt1とする。また、本発明では、単語の重みをtf・idf 値とidf/conf値の２種類で構成しているので、式（６）はそれぞれ式（７）及び式（８）のようになり、当該システムが推定した回答者と真の回答者が一致した場合に担当者ｄの単語tiに、ｉが上位であるほど大きな正の報酬を与える。 In the present invention, a rule corresponds to a word or a combination of words in the dictionary 4, and a rule series of episodes is a word (tl,... Ti,... T 2, t 1) whose score Score ^d is determined by the equation (5). is there. That is, the words collated in the separate document file 7 and the dictionary 40 of the person in charge d. Furthermore, the order i in the sequence the contribution to the score Score ^d of formula (5) forward, i.e. formula (3) and the size of the order of Score _t ^d, the score position 1 of the Score _t ^d is maximum The word t (first collation word) is assumed to be t1. Further, in the present invention, the word weight is composed of two types of tf · idf value and idf / conf value, so that the equation (6) becomes the equation (7) and the equation (8), respectively. When the respondent estimated by the system matches the true respondent, the higher the i is, the higher positive reward is given to the word ti of the person in charge d.

但し、本発明の学習方法においては、tf・idf 値とidf/conf値について、それぞれ別に強化するが、その方法は全く同様である（図１３中の処理を参照）。 However, in the learning method of the present invention, the tf · idf value and the idf / conf value are enhanced separately, but the method is exactly the same (see the processing in FIG. 13).

ＰＳでは、有効な単語の重みが強化され、無効な単語の重みが抑制されることを保証しなければならない。この条件は合理性定理により式（９）を満足することであることが周知のように証明されている。 In PS, it must be ensured that valid word weights are enhanced and invalid word weights are suppressed. It is well known that this condition satisfies the expression (9) by the rationality theorem.

ここで、Ｗはエピソードの最大長、Ｌは同一感覚入力下に存在する有効ルールの最大数であり、本発明では学習を有効にする単語数を限定し、別文書ファイル7と辞書４で一致した単語でスコアScore^dの上位何単語を学習するかを設定することとし、その単語数をＬとした（例えば、Ｌ＝１０なら上位１０位の単語を学習する、以下同じ）。また、この式（９）の定理を満足する最も簡単な強化関数には、式（１０）で表される等比減少関数を用いる。 Here, W is the maximum length of an episode, L is the maximum number of valid rules that exist under the same sense input, and in the present invention, the number of words that enables learning is limited, and is matched between the separate document file 7 and the dictionary 4 The number of words in the score Score ^d to be learned is set to L, and the number of words is set to L (for example, if L = 10, the top 10 words are learned, and so on). The simplest enhancement function that satisfies the theorem of Equation (9) is the geometric ratio decreasing function represented by Equation (10).

ここで、周知のように、Ｓ≧Ｌ＋１を満足しなければならない。また、式（１０）の初期値ｆ₁ ^d、即ちスコアScore_t ^dが最大となる単語の強化値は以下の式（１１）とする。ここで、c1は定数である。 Here, as is well known, S ≧ L + 1 must be satisfied. Further, the initial value f ₁ ^d of equation (10), that is, the reinforcement value of the word having the maximum score Score _t ^d is defined by the following equation (11). Here, c1 is a constant.

このように、重要語の学習は、推定した回答者が真の回答者であった場合に行い、スコアScore_t ^d が上位である単語の重みに正の報酬を与え、その重みを大きくする。 As described above, learning of the important word is performed when the estimated respondent is a true answerer, and a positive reward is given to the weight of a word having a higher score Score _t ^d and the weight is increased.

図１４は、不要語学習処理フローであり、強化学習装置３がステップS176において実行する不要語の学習処理を示す。強化学習装置３が、定数c2、Ｌ、Ｓを定義ファイルから読み込み（ステップS191）、単語ｔにメール解析結果テーブルにおいて対象回答者の第１位照合単語を設定する（ステップS192）。次に、強化学習装置３が、対象回答者のtf・idf 辞書４１でｔを検索してtf・idf にｔと照合した単語のtf・idf 値を設定し、tf・idf とc2とからf＿tf・idf を算出し（ステップS193）、対象回答者のidf/conf辞書４２の第１単語でｔを検索してidf/confにｔと照合した単語のidf/conf値を設定し、idf/confとc2とからf＿idf/confを算出し（ステップS194）、不要語強化処理を実行する（ステップS195）。不要語強化処理については図１５を参照して後述する。 FIG. 14 is an unnecessary word learning process flow, and shows the unnecessary word learning process executed by the reinforcement learning device 3 in step S176. The reinforcement learning device 3 reads the constants c2, L, and S from the definition file (step S191), and sets the first matching word of the target answerer in the mail analysis result table to the word t (step S192). Next, the reinforcement learning device 3 searches the target respondent's tf • idf dictionary 41 for t, sets tf • idf to tf • idf and sets the tf • idf value of the word matched with t, and uses f_tf from tf • idf and c2・ Idf is calculated (step S193), t is searched with the first word in the target respondent's idf / conf dictionary 42, idf / conf is set to idf / conf, and idf / conf is set to idf / conf. F_idf / conf is calculated from c2 and c2 (step S194), and unnecessary word reinforcement processing is executed (step S195). The unnecessary word enhancement processing will be described later with reference to FIG.

次に、強化学習装置３が、ｉ＝２として（ステップS196）、単語ｔにメール解析結果テーブルにおいて対象回答者の第ｉ位照合単語を設定する（ステップS197）。次に、強化学習装置３が、対象回答者のtf・idf 辞書４１でｔを検索してtf・idf にｔと照合した単語のtf・idf 値を設定し、f＿tf・idf とＳとから新たなf＿tf・idf を算出し（ステップS198）、対象回答者のidf/conf辞書４２の第１単語でｔを検索してidf/confにｔと照合した単語のidf/conf値を設定し、f＿idf/confとＳとから新たなf＿idf/confを算出し（ステップS199）、ステップS195と同様の不要語強化処理を実行する（ステップS1910 ）。この後、強化学習装置３が、ｉがＬに等しいか否かを調べ（ステップS1911 ）、等しくない場合、ｉをｉ＋１とし（ステップS1912 ）、ステップS197以下を繰り返す。等しい場合、ステップS1912 を省略して、処理を終了する。 Next, the reinforcement learning device 3 sets i = 2 (step S196), and sets the i-th collation word of the target answerer in the mail analysis result table to the word t (step S197). Next, the reinforcement learning device 3 searches for t in the target respondent's tf • idf dictionary 41 and sets tf • idf value of the word matched with t in tf • idf, and newly adds f_tf • idf and S F_tf · idf is calculated (step S198), t is searched with the first word in the target respondent's idf / conf dictionary 42, idf / conf value of the word matched with t is set in idf / conf, and f_idf New f_idf / conf is calculated from / conf and S (step S199), and unnecessary word reinforcement processing similar to step S195 is executed (step S1910). Thereafter, the reinforcement learning device 3 checks whether i is equal to L (step S1911). If not, i is set to i + 1 (step S1912), and step S197 and subsequent steps are repeated. If equal, step S1912 is omitted and the process ends.

図１５は、不要語強化処理フローであり、強化学習装置３がステップS195及びS1910 において実行する不要語の強化処理を示す。強化学習装置３が、担当者ｄに対象回答者を設定し（ステップS201）、tf・idf −f＿tf・idf が「０」より大きいか否かを調べ（ステップS202）、大きい場合、tf・idf 辞書４１のｔのtf・idf 値をtf・idf −f＿tf・idf に更新し（ステップS203）、大きくない場合、tf・idf 辞書４１のｔのtf・idf 値を「０」に更新する（ステップS204）。次に、強化学習装置３が、idf/conf−f＿idf/confが「０」より大きいか否かを調べ（ステップS205）、大きい場合、idf/conf辞書４２のｔのidf/conf値をidf/conf−f＿idf/confに更新し（ステップS206）、大きくない場合、idf/conf辞書４２のｔのidf/conf値を「０」に更新する（ステップS207）。次に、強化学習装置３が、真の回答者集合以外の全担当者について処理済か否かを調べ（ステップS208）、処理済でない場合、担当者ｄに次の担当者（対象回答者）を設定し（ステップS209）、担当者ｄのtf・idf 辞書４１でｔを検索してtf・idf にｔと照合した単語のtf・idf 値を設定し（ステップS2010 ）、担当者ｄのidf/conf辞書４２の第１単語でｔを検索してidf/confにｔと照合した単語のidf/conf値を設定し（ステップS2011 ）、ステップS202以下を繰り返す。ステップS208において全担当者について処理済である場合、処理を終了する。 FIG. 15 is a flow of unnecessary word reinforcement processing, and shows the reinforcement processing of unnecessary words executed by the reinforcement learning device 3 in steps S195 and S1910. The reinforcement learning device 3 sets a target respondent to the person in charge d (step S201), and checks whether tf · idf−f_tf · idf is greater than “0” (step S202). The tf · idf value of t in the dictionary 41 is updated to tf · idf −f_tf · idf (step S203). If not large, the tf · idf value of t in the tf · idf dictionary 41 is updated to “0” (step S203). S204). Next, the reinforcement learning device 3 checks whether or not idf / conf−f_idf / conf is larger than “0” (step S205). If it is larger, the idf / conf value of t in the idf / conf dictionary 42 is set to idf / Update to conf-f_idf / conf (step S206), and if not, update the idf / conf value of t in the idf / conf dictionary 42 to "0" (step S207). Next, the reinforcement learning device 3 checks whether or not all persons in charge other than the true answerer set have been processed (step S208). If not processed, the person in charge d is assigned to the next person in charge (target answerer). (Step S209), t is searched in the tf · idf dictionary 41 of the person in charge d, and tf · idf value of the word matched with t is set in tf · idf (step S2010). The first word in the / conf dictionary 42 is searched for t, idf / conf value of the word matched with t is set in idf / conf (step S2011), and step S202 and subsequent steps are repeated. If all the persons in charge have been processed in step S208, the process ends.

不要語の学習は、真の回答者以外をシステムが回答者であると誤推定した場合に行う。この場合は、推定した回答者の上位単語は本来不要語であるが、大きな重みが付与されているものと考えられる。そこで、このような単語の重みには負の報酬を与え、式（１２）によりその重みを小さくする。 Unnecessary words are learned when it is incorrectly estimated that the system is the answerer except for the true answerer. In this case, the estimated upper word of the respondent is originally an unnecessary word, but it is considered that a large weight is given. Therefore, a negative reward is given to the weight of such a word, and the weight is reduced by Expression (12).

なお、式（１２）では重みが負になることを防いでいる。また、強化値の等比関数は式（１０）と同様で、その初期値は式（１３）で与える。 Note that the weight is prevented from being negative in the equation (12). Further, the geometric ratio function of the reinforcement value is the same as that in the equation (10), and the initial value is given by the equation (13).

ここで、c2は定数である。更に、経験的にこのような不要語は一般用語であることが大多数であり、システムが回答者として推定した者だけではなく、真の回答者以外の辞書４０でも負の報酬を与えることにより、不要語の学習効率が大きく向上する。そこで、この場合の学習は、システムが誤推定した回答者の上位単語について、真の回答者以外の他の担当者の辞書４０にその単語が登録されている場合、同様に負の報酬を与え、その重みを小さくする。 Here, c2 is a constant. Furthermore, the majority of such unnecessary words are empirically found in general terms, and not only those who the system estimates as respondents, but also by giving negative rewards in the dictionary 40 other than the true respondents. The learning efficiency of unnecessary words is greatly improved. Therefore, the learning in this case gives a negative reward in the same manner when the word is registered in the dictionary 40 of the person in charge other than the true respondent for the top word of the respondent mistakenly estimated by the system. , Reduce its weight.

図１６は、特定語学習処理フローであり、強化学習装置３がステップS178において実行する特定語の学習処理を示す。強化学習装置３が、定数c3、Ｌ、Ｓ、min ＿idf 、max ＿ave ＿ratio を定義ファイルから読み込み（ステップS211）、idf 値再評価処理を実行する（ステップS212）。idf 値再評価処理については図１７を参照して後述する。次に、強化学習装置３が、単語ｔにメール解析結果テーブルにおいて対象非回答者の第１位照合単語を設定し（ステップS213）、対象非回答者のtf・idf 辞書４１でｔを検索してtf・idf にｔと照合した単語のtf・idf 値を設定し（ステップS214）、max ＿tf・idf に対象非回答者のtf・idf 辞書４１での最大tf・idf 値を設定し（ステップS215）、tf・idf がmax ＿tf・idf とmax ＿ave ＿ratio との積以上であるか否かを調べる（ステップS216）。なお、max ＿ave ＿ratio は全担当者のtf・idf 値及びidf/conf値の最大値で標準化した平均的な重みを表し、このmax ＿ave ＿ratio とmax ＿tf・idf との積は後述する式（１４）のｗ_p ^dとなり、対象非回答者のスコアに有効となる重みを表す。max ＿ave ＿ratio の算出方法は（図１８で）後述する。f＿tf・idf がmax ＿tf・idf とmax ＿ave ＿ratio との積以上でない場合、f＿tf・idf にmax ＿tf・idf とmax ＿ave ＿ratio との積を設定し（ステップS217）、積以上である場合、f＿tf・idf にtf・idf とc3とから求まる値を設定する（ステップS218）。 FIG. 16 is a specific word learning process flow, and shows the specific word learning process executed by the reinforcement learning device 3 in step S178. The reinforcement learning device 3 reads the constants c3, L, S, min_idf, and max_ave_ratio from the definition file (step S211), and executes idf value re-evaluation processing (step S212). The idf value reevaluation process will be described later with reference to FIG. Next, the reinforcement learning device 3 sets the first matching word of the target non-responder in the mail analysis result table for the word t (step S213), and searches for t in the tf / idf dictionary 41 of the target non-responder. Tf · idf is set to the tf · idf value of the word matched with t (step S214), and max_tf · idf is set to the maximum tf · idf value of the target non-responder's tf · idf dictionary 41 (step S214). S215), it is checked whether or not tf · idf is greater than or equal to the product of max_tf · idf and max_ave_ratio (step S216). Note that max_ave_ratio represents an average weight standardized by the maximum values of tf · idf values and idf / conf values of all persons in charge, and the product of max_ave_ratio and max_tf · idf is an expression (14) described later. ) W _p ^d , and represents the weight effective for the score of the target non-respondent. A method of calculating max_ave_ratio will be described later (in FIG. 18). If f_tf · idf is not greater than or equal to the product of max_tf · idf and max_ave_ratio, the product of max_tf · idf and max_ave_ratio is set in f_tf · idf (step S217), and if greater than or equal to f_tf · idf A value obtained from tf · idf and c3 is set to idf (step S218).

この後、強化学習装置３が、対象非回答者のidf/conf辞書４２の第１単語でｔを検索してidf/confにｔと照合した単語のidf/conf値を設定し（ステップS219）、max ＿idf/confに対象非回答者のidf/conf辞書４２での最大idf/conf値を設定し（ステップS2110 ）、idf/confがmax ＿idf/confとmax ＿ave ＿ratio との積以上であるか否かを調べる（ステップS2111 ）。なお、max ＿ave ＿ratio については、図１６において前述した通りである。f＿idf/confがmax ＿idf/confとmax ＿ave ＿ratio との積以上でない場合、f＿idf/confにmax ＿idf/confとmax ＿ave ＿ratio との積を設定し（ステップS2112 ）、積以上である場合、f＿idf/confにidf/confとc3とから求まる値を設定する（ステップS2113 ）。 Thereafter, the reinforcement learning device 3 searches for t in the first word of the target non-respondent's idf / conf dictionary 42 and sets the idf / conf value of the word matched with t in idf / conf (step S219). , Max_idf / conf is set to the maximum idf / conf value in the idf / conf dictionary 42 of the target non-respondent (step S2110), and is idf / conf greater than or equal to the product of max_idf / conf and max_ave_ratio? Whether or not is checked (step S2111). Note that max_ave_ratio is as described above with reference to FIG. If f_idf / conf is not equal to or greater than the product of max_idf / conf and max_ave_ratio, the product of max_idf / conf and max_ave_ratio is set in f_idf / conf (step S2112), and if greater than or equal to the product, f_idf / A value obtained from idf / conf and c3 is set in conf (step S2113).

この後、強化学習装置３が、tf・idf 辞書４１及びidf/conf辞書４２におけるｔのtf・idf 値及びidf/conf値を更新する（ステップS2114 ）。即ち、新たなtf・idf 値を、それまでのtf・idf 値にf＿tf・idf を加算した値とし、新たなidf/conf値を、それまでのidf/conf値にf＿idf/confを加算した値とする。この後、強化学習装置３が、ｉ＝２として（ステップS2115 ）、単語ｔにメール解析結果テーブルにおいて対象回答者の第ｉ位照合単語を設定する（ステップS2116 ）。次に、強化学習装置３が、対象回答者のtf・idf 辞書４１でｔを検索してtf・idf にｔと照合した単語のtf・idf 値を設定し、f＿tf・idf とＳとから新たなf＿tf・idf を算出し（ステップS2117 ）、対象回答者のidf/conf辞書４２の第１単語でｔを検索してidf/confにｔと照合した単語のidf/conf値を設定し、f＿idf/confとＳとから新たなf＿idf/confを算出し（ステップS2118 ）、tf・idf 辞書４１及びidf/conf辞書４２におけるｔのtf・idf 値及びidf/conf値を更新する（ステップS2119 ）。即ち、新たなtf・idf 値を、それまでのtf・idf 値にf＿tf・idf を加算した値とし、新たなidf/conf値を、それまでのidf/conf値にf＿idf/confを加算した値とする。この後、強化学習装置３が、ｉがＬに等しいか否かを調べ（ステップS2120 ）、等しくない場合、ｉをｉ＋１とし（ステップS2121 ）、ステップS2116 以下を繰り返す。等しい場合、ステップS2121 を省略して、処理を終了する。 Thereafter, the reinforcement learning device 3 updates the tf · idf value and idf / conf value of t in the tf · idf dictionary 41 and the idf / conf dictionary 42 (step S2114). That is, the new tf · idf value is the value obtained by adding f_tf · idf to the previous tf · idf value, the new idf / conf value is the value obtained by adding f_idf / conf to the previous idf / conf value And Thereafter, the reinforcement learning device 3 sets i = 2 (step S2115), and sets the i-th collation word of the target respondent in the mail analysis result table to the word t (step S2116). Next, the reinforcement learning device 3 searches for t in the target respondent's tf • idf dictionary 41 and sets tf • idf value of the word matched with t in tf • idf, and newly adds f_tf • idf and S F_tf · idf is calculated (step S2117), t is searched with the first word of the target respondent's idf / conf dictionary 42, idf / conf value of the word matched with t is set in idf / conf, and f_idf New f_idf / conf is calculated from / conf and S (step S2118), and the tf · idf value and idf / conf value of t in the tf · idf dictionary 41 and idf / conf dictionary 42 are updated (step S2119). That is, the new tf · idf value is the value obtained by adding f_tf · idf to the previous tf · idf value, the new idf / conf value is the value obtained by adding f_idf / conf to the previous idf / conf value And Thereafter, the reinforcement learning device 3 checks whether i is equal to L (step S2120). If it is not equal, i is set to i + 1 (step S2121), and step S2116 and subsequent steps are repeated. If equal, step S2121 is omitted and the process is terminated.

図１７は、idf 値再評価処理フローであり、強化学習装置３がステップS212において実行するidf 値再評価処理を示す。強化学習装置３が、メール解析結果テーブルにおいて全非回答者の上位Ｌ位までの単語集合でidf 値を算出し（ステップS221）、メール解析結果テーブルにおいて対象非回答者の上位Ｌ位までの単語のScore(d，t) をidf 値に書き換え（ステップS222）、メール解析結果テーブルから対象非回答者のmin ＿idf 未満のidf 値の単語と、Ｌ＋１位以下の単語とを削除し（ステップS223）、メール解析結果テーブルにおいて対象非回答者の単語をidf 値で降順に整列させる（ステップS224）。なお、ステップS223において、min ＿idf 未満のidf 値の単語、又は、Ｌ＋１位以下の単語が存在しない場合には、削除すべき単語が存在しない場合がある。また、ステップS223における単語の削除によりｉの最大値がＬより小さくなることがあるが、この場合、読み込んだＬに代えて当該ｉの最大値が用いられる。即ち、ステップS2120においてｉが当該ｉの最大値に等しい場合、処理を終了する。 FIG. 17 is an idf value re-evaluation process flow, and shows the idf value re-evaluation process executed by the reinforcement learning device 3 in step S212. The reinforcement learning device 3 calculates the idf value with the word set up to the top L rank of all non-responders in the mail analysis result table (step S221), and the words up to the top L rank of the target non-responder in the mail analysis result table Rewrite Score (d, t) to idf value (step S222), and delete the word with idf value less than min_idf of the target non-responder and words below L + 1 from the mail analysis result table (step S223) In the mail analysis result table, the words of the target non-responder are arranged in descending order by idf value (step S224). In step S223, if there is no word having an idf value less than min_idf or a word of L + 1 or lower, there may be no word to be deleted. Further, the maximum value of i may be smaller than L due to the deletion of the word in step S223. In this case, the maximum value of i is used instead of the read L. That is, if i is equal to the maximum value of i in step S2120, the process ends.

特定語の学習は、真の回答者をシステムが回答者であると推定できなかった場合に行う。この場合は、真の回答者の辞書４０内で重要語の重みが本来の値より小さくなっている恐れがある。これは、真の回答者がある専門用語を固定的に多く使用し、その同意語や類似語を滅多に使用しない場合等では同意語や類似語のtf値が極端に小さくなるために起こり得る。そこで、真の回答者のスコア上位単語と他の担当者の上位単語を比較して、単語の特定性をidf 値の算出方法と同様にして再評価する。即ち、回答者としては推定されなかった真の回答者の上位単語に特定性が認められれば特定語として正の報酬を与え、その重みを大きくする。ここでの強化方法は、idf 値で単語系列を決定する以外は式（１０）の真の回答者を推定できた場合と同様であるが、idf 値が最大となる単語の強化値は式（１１）の代わりに式（１４）を用いる。 A specific word is learned when a true respondent cannot be estimated as a respondent by the system. In this case, the weight of the important word may be smaller than the original value in the dictionary 40 of the true respondent. This can happen because the tf value of a synonym or similar word becomes extremely small when a true respondent uses a fixed number of technical terms and rarely uses the synonym or similar word. . Therefore, the higher word of the true respondent's score is compared with the upper word of another person in charge, and the word specificity is re-evaluated in the same manner as the idf value calculation method. In other words, if specificity is recognized in the upper word of a true respondent who was not estimated as an answerer, a positive reward is given as the specified word, and the weight is increased. The reinforcement method here is the same as that in the case where the true respondent in the equation (10) can be estimated except that the word sequence is determined by the idf value. However, the enhancement value of the word having the maximum idf value is the equation ( Equation (14) is used instead of 11).

ここで、式（１１）と同様にc3は定数である。なお、辞書４中の全ての単語について、標準化した順位と重みの関係は図１８に示したような曲線となる。図１８の横軸は、各担当者の単語の最大順位（即ち、登録単語数）で各単語の順位を割った値を表し、縦軸は、同様に重みの最大値で各単語の重みを割った値を表す。図１８は横軸の各順位に該当する全担当者の平均重み（tf・idf 値とidf/conf値の両方の平均）をプロットしたものである。この曲線の平均傾きを「−１」と仮定し、この曲線と、傾きが「−１」の直線との接点における重みの値を図１６におけるmax ＿ave ＿ratio とした。式（１４）のｗ_p ^dはこのmax ＿ave ＿ratio と最大重みとの積を表す。真の回答者を当該システムが回答者であると推定（分類）できなかった場合は、真の回答者のスコアScore_t ^dの上位単語は当該担当者の辞書４０内で低順位であることが多いと考えられる。このような低順位のものの重みは、高順位のものの重みと比較して何桁も小さな値となっている。このため、強化値を元の重みの定数倍として与えても全体のスコアに影響する大きさにならない。そこで、学習の効果が現れないことを受け、スコア1位のｗ₁ ^dがこのｗ_p ^dより小さければ、一度にｗ_p ^dまで引き上げ、強化することとした。なお、本発明者の検討によれば、単純な重みの平均値をｗ_p ^dとすると、重みの小さい単語が非常に多いために、ｗ_p ^dは小さな値となり、学習効果が上がらないことが判った。 Here, c3 is a constant as in equation (11). Note that the relationship between the standardized rank and weight for all the words in the dictionary 4 is a curve as shown in FIG. The horizontal axis of FIG. 18 represents a value obtained by dividing the rank of each word by the maximum rank of words of each person in charge (that is, the number of registered words), and the vertical axis similarly represents the weight of each word by the maximum weight value. Represents the divided value. FIG. 18 is a plot of the average weight (average of both tf · idf value and idf / conf value) of all persons in charge corresponding to each rank on the horizontal axis. Assuming that the average slope of this curve is “−1”, the value of the weight at the point of contact between this curve and the straight line with the slope of “−1” is max_ave_ratio in FIG. In equation (14), w _p ^d represents the product of this max_ave_ratio and the maximum weight. That is if the system the true respondent can not be estimated (classification) and a respondent, the higher word scores Score _t ^d of the true respondent is a low rank in the dictionary 40 of the personnel It is thought that there are many. The weight of such a low rank is a value that is many orders of magnitude smaller than the weight of a high rank. For this reason, even if the reinforcement value is given as a constant multiple of the original weight, it does not affect the overall score. Therefore, in response to the fact that the learning effect does not appear, if w ₁ ^d of the _first score is smaller than this w _p ^d , it is increased to w _p ^{d at a} time and strengthened. According to the study by the present inventor, if the average value of simple weights is w _p ^d , since there are many words with small weights, w _p ^d becomes a small value and the learning effect does not increase. understood.

学習時の単語の照合は、この例では完全一致としている（従って、図１３〜図１７においては単語は必ず照合されることになる）が、これは本発明者の検討により、別文書ファイル７を学習に用いる場合、メール解析のように部分一致としなくても学習効果が上がることが確かめられたためであり、学習にメール８を用いる場合には、部分一致として強化値に単語の一致率を掛けて、強化しても良い。 In this example, word matching at the time of learning is completely matched (therefore, words are always checked in FIGS. 13 to 17). When learning is used for learning, it is confirmed that the learning effect is improved without using partial matching as in mail analysis. When using mail 8 for learning, the word matching rate is set as the partial matching value. It may be multiplied and strengthened.

以上、本発明をその実施の形態に従って説明したが、本発明は、その主旨の範囲内で種々の変形が可能である。例えば、本発明の文書分類システムは、メール８に限られず、広く文書の分類に用いることができる。更に、分類の対象である文書に、例えばホームページを含むことができる。更には、分類の対象である文書に、例えば音声入力されたデータを音声認識して得た電子データを含むことができる。本発明の文書分類システムは、自然言語（記号を含む）により構成された電子データであれば、これを分類することができる。また、文書の分類先は、個々の担当者に限られず、種々の組織内における担当部署であっても良い。本発明の文書分類システムは、３個のサブシステムの全てを備えなくとも良い。即ち、辞書編集装置１、文書分類装置２、強化学習装置３の各々を独立に設けても良く、文書分類装置２のみを設けても良く、文書分類装置２に辞書編集装置１又は強化学習装置３を併設しても良い。 As mentioned above, although this invention was demonstrated according to the embodiment, this invention can be variously deformed within the scope of the gist. For example, the document classification system of the present invention is not limited to the mail 8 and can be widely used for document classification. Further, the document to be classified can include a home page, for example. Furthermore, the document to be classified can include, for example, electronic data obtained by voice recognition of data input by voice. The document classification system of the present invention can classify electronic data configured in a natural language (including symbols). The document classification destination is not limited to each person in charge, and may be a department in charge in various organizations. The document classification system of the present invention may not include all three subsystems. That is, the dictionary editing device 1, the document classification device 2, and the reinforcement learning device 3 may be provided independently, or only the document classification device 2 may be provided, and the dictionary editing device 1 or the reinforcement learning device may be provided in the document classification device 2. 3 may be added.

以上説明したように、本発明によれば、辞書編集装置において、tf・idf 値とidf/conf値とを別々の独立した２個のパラメータとして用いてtf・idf辞書、idf/conf辞書を作成することができるので、これを文書分類装置の辞書として用いることにより、電子メール等の文書の分類に用いた場合、実用的な分類精度を得ることができ、文書を適切な担当者に自動的に正確に分類するための辞書を容易に作成することができる。 As described above, according to the present invention, in the dictionary editing apparatus, a tf / idf dictionary and an idf / conf dictionary are created by using tf / idf value and idf / conf value as two independent parameters. By using this as a dictionary for document classification devices, when used for classification of documents such as e-mails, practical classification accuracy can be obtained, and documents are automatically assigned to the appropriate person in charge. It is possible to easily create a dictionary for classifying correctly.

また、本発明によれば、文書分類装置において、前述の２個の辞書tf・idf辞書、idf/conf辞書を用いることにより、電子メール等の文書の分類に用いた場合、実用的な分類精度を得ることができ、文書を適切な担当者に自動的に正確に分類することができる。 Further, according to the present invention, in the document classification apparatus, when the above two dictionaries tf / idf dictionary and idf / conf dictionary are used for classification of documents such as e-mails, practical classification accuracy is obtained. And automatically and accurately classify documents to the appropriate personnel.

また、本発明によれば、文書分類システムプログラムを、フレキシブルディスク、ＣＤ−ＲＯＭ、ＣＤ−Ｒ／Ｗ、ＤＶＤ等の媒体に格納すること、又は、インターネット等のネットワークを介してダウンロードすることにより供給することができ、前述の文書分類装置を容易に実現することができ、正確な文書分類を可能とすることができる。 Further, according to the present invention, the document classification system program is supplied by being stored in a medium such as a flexible disk, CD-ROM, CD-R / W, or DVD, or downloaded via a network such as the Internet. Therefore, the above-described document classification device can be easily realized, and accurate document classification can be performed.

文書分類システム構成図である。It is a document classification system block diagram. 辞書説明図である。It is a dictionary explanatory drawing. 文書分類結果（メール解析結果）説明図である。It is a document classification result (mail analysis result) explanatory drawing. 文書分類処理フローである。It is a document classification processing flow. 辞書編集処理フローである。It is a dictionary edit processing flow. 担当者・単語テーブル生成処理フローである。It is a person in charge / word table generation processing flow. 単語の組み合わせテーブル生成処理フローである。It is a word combination table generation processing flow. 単語の組み合わせテーブル説明図である。It is word combination table explanatory drawing. 問い合わせメール解析処理フローである。It is an inquiry mail analysis processing flow. tf・idf 値重み付き加算処理フローである。It is a tf / idf value weighted addition process flow. idf/conf値重み付き加算処理フローである。It is an idf / conf value weighted addition process flow. 強化学習処理フローである。It is a reinforcement learning process flow. 重要語学習処理フローである。It is an important word learning process flow. 不要語学習処理フローである。It is an unnecessary word learning process flow. 不要語強化処理フローである。It is an unnecessary word reinforcement | strengthening process flow. 特定語学習処理フローである。It is a specific word learning process flow. idf 値再評価処理フローである。It is an idf value re-evaluation processing flow. 順位と重みの関係説明図である。It is an explanatory view of the relationship between rank and weight.

Explanation of symbols

１辞書編集装置
２文書分類装置（メール解析装置）
３強化学習装置
４辞書
４０カテゴリ辞書
４１重要性辞書（tf・idf 辞書）
４２同時出現性辞書（idf/conf辞書） 1 Dictionary editing device 2 Document classification device (mail analysis device)
3 Reinforcement learning device 4 Dictionary 40 Category dictionary 41 Importance dictionary (tf / idf dictionary)
42 Co-occurrence dictionary (idf / conf dictionary)

Claims

A dictionary editing device for creating a dictionary based on a document in which a category belongs,
For each word that appears in the category, create an importance dictionary that stores tf and idf values that represent the importance of the word alone,
A dictionary editing apparatus that creates a co-occurrence dictionary that stores an idf / conf value representing co-occurrence between words for each combination of a plurality of words for words that appear in the category.

The co-occurrence dictionary stores an idf / conf value representing co-occurrence between two words for each combination of a first word and a second word for a word that appears in the category. The dictionary editing device described.

An importance dictionary that includes a plurality of category dictionaries provided for each of a plurality of categories, and each category dictionary stores a tf / idf value that represents the importance of a single word for each word that appears in the category, and A dictionary comprising a co-occurrence dictionary for storing an idf / conf value representing co-occurrence between two words for each combination of the first word and the second word for the appearing word;
Using the words that appear in the input document, the words are compared in the dictionary to obtain tf · idf values and idf / conf values for each word in the dictionary, and a predetermined calculation is performed based on these values. A score for each word is calculated, a score for each category is calculated based on the score for each word, and the input document is classified into one of the plurality of categories based on the score. Document classification device.

4. The document according to claim 3, wherein the document is an e-mail, the category is a person in charge of distributing the document, and the document classification device distributes the input document to the classified person in charge. Document classification device.

4. The document classification apparatus according to claim 3, wherein the document classification apparatus classifies the input document into a predetermined number of categories from the top in accordance with the score for each category.

The document classification device calculates a score for each word by calculating a product of a tf · idf value for each word in the dictionary, an idf / conf value, and a matching rate between words appearing in the input document. The document classification device according to claim 3.

The document classification device further includes:
Based on the result of classifying a predetermined document into one of the plurality of categories by the document classification device, the tf · idf value in the importance dictionary is updated, and the idf / conf value in the co-occurrence dictionary is updated. The document classification device according to claim 3, comprising: a learning device.

The learning device
When the predetermined document is classified into a true category, a word that appears in the predetermined document and collated in the dictionary of the true category is regarded as an important word, and its weight is increased.
When the predetermined document is not classified into the true category, the specific word of the true category is re-evaluated by re-evaluating a word having high specificity with respect to the true category among the words appearing in the predetermined document. Increase the weight of
If the predetermined document is classified into a category other than the true category, the word that appears in the predetermined document and is matched with the dictionary of the category that has been incorrectly classified is used as an unnecessary word, and is erroneously classified. The document classification apparatus according to claim 7, wherein the weight in the dictionary of the selected category is reduced and the weight in the dictionary of a category other than the true category is reduced.

A program for realizing a document classification device,
The program is stored in a computer.
Tf ・ idf that consists of a plurality of category dictionaries provided for each of a plurality of categories using words that appear in the input document, and each category dictionary represents the importance of each word for each word that appears in the category An importance dictionary for storing values, and a co-occurrence dictionary for storing idf / conf values representing co-occurrence between two words for each combination of the first word and the second word for words appearing in the category A process for collating the word with a dictionary to obtain a tf · idf value and an idf / conf value for each word in the dictionary;
A process of calculating a score for each word by performing a predetermined calculation based on the tf · idf value and the idf / conf value for each word;
Processing for calculating a score for each category based on the score for each word;
A document classification program for executing a process of classifying the input document into any of the plurality of categories based on a score for each category.