JP5364529B2

JP5364529B2 - Dictionary registration device, document label determination system, and dictionary registration program

Info

Publication number: JP5364529B2
Application number: JP2009233756A
Authority: JP
Inventors: 正柳原; 一則松本; 康弘滝嶋
Original assignee: KDDI R&D Laboratories Inc
Current assignee: KDDI R&D Laboratories Inc
Priority date: 2009-10-07
Filing date: 2009-10-07
Publication date: 2013-12-11
Anticipated expiration: 2029-10-07
Also published as: JP2011081626A

Abstract

<P>PROBLEM TO BE SOLVED: To detect a word that expresses a content characteristically and is related to a specific property, or a combination of words from among words appearing in an electronic document while considering relations to other words appearing in the electronic document. <P>SOLUTION: A dictionary registering device includes: an electronic document storage section 110 for storing a set of correct answer documents related to a specific property and a set of incorrect answer documents not related to the property; an input section 176 to which a first word and a second word, namely dictionary registration candidates for determining a document related to the property are input; a document number calculation section 175 for calculating the number the first to eighth documents at each combination of the first and second words; an information amount reference amount calculation section 177 for calculating first to third information volume reference amounts based on the number of the first to eighth documents; and a registration word selection section 178 for comparing the first to third information volume reference amounts and selecting one having the maximum degree related to the property from among the first word and the second word, and a combination of the first word and the second word as a registration word. <P>COPYRIGHT: (C)2011,JPO&INPIT

Description

本発明は、電子文書に含まれるテキスト情報の内容が、任意のラベルに該当するか否かを判定するために最適な単語を判定する辞書登録装置、文書ラベル判定システムおよび辞書登録プログラムに関する。 The present invention relates to a dictionary registration apparatus, a document label determination system, and a dictionary registration program for determining an optimum word for determining whether the content of text information included in an electronic document corresponds to an arbitrary label.

従来、ブログ等のテキストベースのウェブコンテンツや、ワープロソフトなどによって生成される文書ファイルなどの電子文書に対して、その電子文書に含まれるテキスト情報の内容がどのような性質を持つものであるかを判定し、その内容に応じたラベルを付与して電子文書を分類する文書ラベル判定システムが利用されている。ラベルには、例えば、スポーツ、経済などの電子文書のトピックを示すラベルがある。このようなラベルのうち、任意のラベルにラベル判定対象の電子文書が該当するか否かを判定する際には、そのラベルに関連性の高い複数の索引語が対応付けられた辞書データが用いられる。例えば、ラベルが「経済」である場合には、索引語として「財務省」、「為替」などの単語が対応付けられた辞書データが予め記憶される。文書ラベル判定システムは、辞書データに含まれる索引語に一致する単語をラベル判定対象の電子文書から検出し、その一致の度合いに応じて、その電子文書が任意のラベルに該当するか否かを判定する。 Conventionally, for text-based web content such as blogs and electronic documents such as document files generated by word processing software, what kind of property the text information contained in the electronic document has? A document label determination system that classifies electronic documents by assigning a label according to the content of the document and using the label is used. Examples of the label include a label indicating a topic of an electronic document such as sports or economy. Among such labels, when determining whether or not an electronic document subject to label determination corresponds to an arbitrary label, dictionary data in which a plurality of index terms having high relevance are associated with the label is used. It is done. For example, when the label is “Economy”, dictionary data in which words such as “Ministry of Finance” and “Exchange” are associated as index words are stored in advance. The document label determination system detects a word that matches the index word included in the dictionary data from the electronic document to be determined, and determines whether the electronic document corresponds to an arbitrary label depending on the degree of the match. judge.

特許文献１には、電子文書中に出現する単語を評価して、その電子文書の内容を示す特徴的な単語を検出して電子文書の要約をする技術が示されている。ここでは、電子文書中に出現する複数の単語の組み合わせに応じてその電子文書に出現する単語にスコア付けを行い、スコアに応じて単語を評価することにより、信頼性の高い単語重要度を算出している。
また、特許文献２には、情報基準量に基づくモデル検定を行って単語重要度を算出する技術が提案されている。ここでは、独立モデルにより算出するスコアから、従属モデルにより算出するスコアを差し引いて算出された値が０よりも大きな単語を、重要な単語として選び出している。
また、非特許文献１には、情報量基準に基づくモデル検定を行い、トピックに該当するかを判定する上で重要な単語のみを選出する技術が提案されている。 Patent Document 1 discloses a technique for evaluating words that appear in an electronic document, detecting characteristic words indicating the contents of the electronic document, and summarizing the electronic document. Here, the word importance that is highly reliable is calculated by scoring the word that appears in the electronic document according to the combination of a plurality of words that appear in the electronic document and evaluating the word according to the score. doing.
Further, Patent Document 2 proposes a technique for calculating a word importance by performing a model test based on an information reference amount. Here, a word having a value greater than 0 calculated by subtracting the score calculated by the dependent model from the score calculated by the independent model is selected as an important word.
Further, Non-Patent Document 1 proposes a technique for selecting only words that are important in determining whether a topic corresponds to a model test based on an information amount criterion.

特開２００５−１４１４２８号公報JP 2005-141428 A 特開２００５−２８４２０９号公報JP 2005-284209 A

Kazunori Matsumoto, Kazuo Hashimoto, "Schema Design for Causal Law Mining from Incomplete Database", Discovery Science, Second International Conference, DS '99, Tokyo, Japan, December, 1999,Proceedings. Lecture Notes in Computer Science 1721 Springer, pp.92-102,1999.Kazunori Matsumoto, Kazuo Hashimoto, "Schema Design for Causal Law Mining from Incomplete Database", Discovery Science, Second International Conference, DS '99, Tokyo, Japan, December, 1999, Proceedings. Lecture Notes in Computer Science 1721 Springer, pp.92 -102,1999.

しかしながら、ある電子文書が特定のラベルに該当するか否かを判定する際に参照する辞書データは、ユーザにより任意に作成された辞書データが用いられる場合がある。このような辞書データでは、そのラベルに最適な索引語が対応付けられているとは限らず、また索引語が固定されるため、時事的に変化する電子文書の内容の変化に応じて柔軟に索引語を変化させるのは困難である。
そこで、特定のラベルに該当すると判定された複数の電子文書から、その電子文書に含まれる単語を事後的に解析して、その電子文書に含まれる単語に応じて辞書データの内容を再帰的に更新する方法が考えられる。例えば、電子文書中に索引語の候補となる任意の単語が出現する割合と、その他の単語との割合とに応じて候補単語にスコア付けを行い、そのスコアに応じて索引語として採用するかどうかを決定することが考えられる。ただし、この方法は、電子文書中に出現する単語がそれぞれに独立して出現する回数に応じてスコア付けを行うものであるが、電子文書中の単語は、他の単語との対応関係や関連性により意味内容や重要度が異なる場合があり、必ずしも精度の良い索引語を検出できるとはいえない。ここで、特許文献１に示される技術を応用し、索引語検出の精度を上げるために、複数の単語の組み合わせによりその単語のスコア付けを行って辞書データを生成する方法も考えられるが、これでは、索引語候補が多くなるとその組み合わせ数が爆発的に増加し、計算量が多くなるという問題がある。 However, dictionary data arbitrarily created by the user may be used as the dictionary data to be referred to when determining whether or not a certain electronic document corresponds to a specific label. In such dictionary data, the optimal index word is not always associated with the label, and the index word is fixed, so that it can be flexibly adapted to changes in the contents of electronic documents that change with time. It is difficult to change the index word.
Therefore, from a plurality of electronic documents determined to correspond to a specific label, the words included in the electronic document are analyzed afterwards, and the contents of the dictionary data are recursively according to the words included in the electronic document. A method of updating can be considered. For example, whether candidate words are scored according to the ratio of the occurrence of any word that is a candidate for an index word in an electronic document and the ratio to other words, and are adopted as index words according to the score. It is possible to decide whether or not. However, although this method scores according to the number of times words appearing in the electronic document appear independently of each other, the words in the electronic document are associated with or related to other words. The semantic content and importance may vary depending on the nature, and it cannot always be said that an accurate index word can be detected. Here, in order to apply the technique disclosed in Patent Document 1 and improve the accuracy of index word detection, a method of generating dictionary data by scoring the word by a combination of a plurality of words is also conceivable. However, there is a problem that as the number of index word candidates increases, the number of combinations increases explosively and the amount of calculation increases.

また、特許文献１に示される技術は、スコアの計算にｘ２検定の手法を使用するものであるが、ｘ２検定による解析では、解析するデータの資質によってパラメータを調整する必要がある。また、特許文献２に示される技術では、トピックに該当するか否かを判定する上で重要である単語を検出することはできない。また、特許文献２は、特許文献１と同様に、複数の単語の組み合わせの重要度を求める際には組み合わせ数が爆発的に増大するという問題がある。 The technique disclosed in Patent Document 1 uses a method of x2 test for score calculation, but in the analysis by x2 test, it is necessary to adjust parameters according to the quality of data to be analyzed. Further, with the technique disclosed in Patent Document 2, it is not possible to detect words that are important in determining whether or not a topic is relevant. Further, as in Patent Document 1, there is a problem that the number of combinations explosively increases when the importance of a combination of a plurality of words is obtained.

また、上述の非特許文献１に示される技術により、例えばひとつの単語のスコアと複数の単語の組み合わせのスコアとを算出した場合には、それぞれのスコアは尺度が異なるものとなり、ひとつの単語による索引語と複数の単語の組み合わせによる索引語との精度を比較することができない。 In addition, for example, when the score of one word and the score of a combination of a plurality of words are calculated by the technique shown in Non-Patent Document 1 described above, each score has a different scale and depends on one word. The accuracy of the index word and the index word based on a combination of a plurality of words cannot be compared.

本発明は、このような状況に鑑みてなされたもので、電子文書中に出現する単語のうち、その内容を特徴的に表し、特定の性質に関連する単語または単語の組み合わせを、その電子文書中に出現する他の単語との関連性を考慮して検出することのできる辞書登録装置、文書ラベル判定システムおよび辞書登録プログラムを提供する。 The present invention has been made in view of such a situation. Among words appearing in an electronic document, the contents of the word are characteristically expressed, and a word or a combination of words related to a specific property is expressed in the electronic document. Provided are a dictionary registration device, a document label determination system, and a dictionary registration program that can be detected in consideration of the relationship with other words appearing therein.

上記の課題を解決するために、本発明に係る辞書登録装置は、特定の性質に関係する正解文書の集合と、当該性質に関係しない不正解文書の集合とが記憶される電子文書記憶部と、前記性質に関係する文書を判定するための辞書登録候補である第１の語と第２の語とが入力される入力部と、前記第１の語と前記第２の語との組み合わせ毎に、前記第１の語と前記第２の語との双方が含まれる前記正解文書の数である第１の文書数と、前記第１の語が含まれ前記第２の語が含まれない前記正解文書の数である第２の文書数と、前記第１の語が含まれず前記第２の語が含まれる前記正解文書の数である第３の文書数と、前記第１の語と前記第２の語とのいずれもが含まれない前記正解文書の数である第４の文書数と、前記第１の語と前記第２の語との双方が含まれる前記不正解文書の数である第５の文書数と、前記第１の語が含まれ前記第２の語が含まれない前記不正解文書の数である第６の文書数と、前記第１の語が含まれず前記第２の語が含まれる前記不正解文書の数である第７の文書数と、前記第１の語と前記第２の語とのいずれもが含まれない前記不正解文書の数である第８の文書数とを算出する文書数算出部と、前記文書数算出部によって算出された前記第１から前記第８の文書数に基づいて、前記第１の語と前記性質との関係の度合いを示す第１の情報量基準量と、前記第２の語と前記性質との関係の度合いを示す第２の情報量基準量と、前記第１の語および前記第２の語の組み合わせと前記性質との関係の度合いを示す第３の情報量基準量とを算出する情報量基準量算出部と、前記情報量基準量算出部によって算出された前記情報量基準量を比較して、前記第１の語と、前記第２の語と、前記第１の語および前記第２の語の組み合わせとのうち、前記性質に関係する度合いが最大であるものを登録語に選択する登録語選択部と、を備えることを特徴とする。 In order to solve the above problems, the dictionary registration device according to the present invention includes an electronic document storage unit that stores a set of correct documents related to a specific property and a set of incorrect documents that are not related to the property. , An input unit for inputting a first word and a second word that are candidates for dictionary registration for determining a document related to the property, and each combination of the first word and the second word Includes the first document number that is the number of the correct documents including both the first word and the second word, and the first word is included and the second word is not included. A second document number that is the number of correct documents, a third document number that is the number of correct documents that do not include the first word and include the second word, and the first word, A fourth document number that is the number of the correct documents not including any of the second words, the first word, and the second word; And a sixth document number that is the number of incorrect documents that include the first word but does not include the second word. And a seventh document number that is the number of incorrect documents that do not include the first word and include the second word, and includes both the first word and the second word. A document number calculation unit that calculates an eighth document number that is the number of incorrect documents that are not valid, and the first to eighth document numbers calculated by the document number calculation unit, A first information amount reference amount indicating a degree of relationship between one word and the property; a second information amount reference amount indicating a degree of relationship between the second word and the property; and the first information amount Information amount reference amount calculation for calculating a third information amount reference amount indicating the degree of relationship between the word and the combination of the second word and the property And the information amount reference amount calculated by the information amount reference amount calculation unit, and the combination of the first word, the second word, the first word, and the second word And a registered word selection unit that selects, as a registered word, a word having the maximum degree related to the property.

本発明に係る辞書登録装置において、前記第１の語と前記性質との関係の度合いを示す第１の情報量基準量「ＡＩＣ（ＤＭ１）」は、
ＭＬＬ＝（Ｎ１１＋Ｎ１２）ｌｏｇ（Ｎ１１＋Ｎ１２）＋（Ｎ１３＋Ｎ１４）ｌｏｇ（Ｎ１３＋Ｎ１４）＋（Ｎ２１＋Ｎ２２）ｌｏｇ（Ｎ２１＋Ｎ２２）＋（Ｎ２３＋Ｎ２４）ｌｏｇ（Ｎ２３＋Ｎ２４）＋（Ｎ１１＋Ｎ１３＋Ｎ２１＋Ｎ２３）ｌｏｇ（Ｎ１１＋Ｎ１３＋Ｎ２１＋Ｎ２３）＋（Ｎ１２＋Ｎ１４＋Ｎ２２＋Ｎ２４）ｌｏｇ（Ｎ１２＋Ｎ１４＋Ｎ２２＋Ｎ２４）−２×ＺｌｏｇＺ、
ＡＩＣ（ＤＭ１）＝−２×ＭＬＬ＋２×４、
なる計算式により算出される、
但し、ｌｏｇの底である１０は省略して表記し、Ｚ＝Ｎ１１＋Ｎ１２＋Ｎ１３＋Ｎ１４＋Ｎ２１＋Ｎ２２＋Ｎ２３＋Ｎ２４であり、Ｎ１１は前記第１の文書数、Ｎ１２は前記第２の文書数、Ｎ１３は前記第３の文書数、Ｎ１４は前記第４の文書数、Ｎ２１は前記第５の文書数、Ｎ２２は前記第６の文書数、Ｎ２３は前記第７の文書数、Ｎ２４は前記第８の文書数である、ことを特徴とする。 In the dictionary registration device according to the present invention, the first information amount reference amount “AIC (DM1)” indicating the degree of the relationship between the first word and the property is:
MLL = (N11 + N12) log (N11 + N12) + (N13 + N14) log (N13 + N14) + (N21 + N22) log (N21 + N22) + (N23 + N24) log (N23 + N24) + (N11 + N13 + N21 + N24) 2 × ZlogZ,
AIC (DM1) = − 2 × MLL + 2 × 4,
Calculated by the following formula:
However, the base 10 of the log is omitted, and Z = N11 + N12 + N13 + N14 + N21 + N22 + N23 + N24, where N11 is the first document number, N12 is the second document number, N13 is the third document number, and N14 is The fourth document number, N21 is the fifth document number, N22 is the sixth document number, N23 is the seventh document number, and N24 is the eighth document number. .

本発明に係る辞書登録装置において、前記第２の語と前記性質との関係の度合いを示す第２の情報量基準量「ＡＩＣ（ＤＭ２）」は、
ＭＬＬ＝（Ｎ１１＋Ｎ１３）ｌｏｇ（Ｎ１１＋Ｎ１３）＋（Ｎ１２＋Ｎ１４）ｌｏｇ（Ｎ１２＋Ｎ１４）＋（Ｎ２１＋Ｎ２３）ｌｏｇ（Ｎ２１＋Ｎ２３）＋（Ｎ２２＋Ｎ２４）ｌｏｇ（Ｎ２２＋Ｎ２４）＋（Ｎ１１＋Ｎ１２＋Ｎ２１＋Ｎ２２）ｌｏｇ（Ｎ１１＋Ｎ１２＋Ｎ２１＋Ｎ２２）＋（Ｎ１３＋Ｎ１４＋Ｎ２３＋Ｎ２４）ｌｏｇ（Ｎ１３＋Ｎ１４＋Ｎ２３＋Ｎ２４）−２×ＺｌｏｇＺ、
ＡＩＣ（ＤＭ２）＝−２×ＭＬＬ＋２×４、
なる計算式により算出される、
但し、ｌｏｇの底である１０は省略して表記し、Ｚ＝Ｎ１１＋Ｎ１２＋Ｎ１３＋Ｎ１４＋Ｎ２１＋Ｎ２２＋Ｎ２３＋Ｎ２４であり、Ｎ１１は前記第１の文書数、Ｎ１２は前記第２の文書数、Ｎ１３は前記第３の文書数、Ｎ１４は前記第４の文書数、Ｎ２１は前記第５の文書数、Ｎ２２は前記第６の文書数、Ｎ２３は前記第７の文書数、Ｎ２４は前記第８の文書数である、ことを特徴とする。 In the dictionary registration device according to the present invention, the second information amount reference amount “AIC (DM2)” indicating the degree of the relationship between the second word and the property is:
MLL = (N11 + N13) log (N11 + N13) + (N12 + N14) log (N12 + N14) + (N21 + N23) log (N21 + N23) + (N22 + N24) log (N22 + N24) + (N11 + N12 + N21 + N21 + N14 + N21 + N21 + N14 + N14 + N14 + N24 + N24 + N24 2 × ZlogZ,
AIC (DM2) = − 2 × MLL + 2 × 4,
Calculated by the following formula:
However, the base 10 of the log is omitted, and Z = N11 + N12 + N13 + N14 + N21 + N22 + N23 + N24, where N11 is the first document number, N12 is the second document number, N13 is the third document number, and N14 is The fourth document number, N21 is the fifth document number, N22 is the sixth document number, N23 is the seventh document number, and N24 is the eighth document number. .

本発明に係る辞書登録装置において、前記第１の語および前記第２の語の組み合わせと前記性質との関係の度合いを示す第３の情報量基準量「ＡＩＣ（ＤＭ１２）」は、
ＭＬＬ＝Ｎ１１ｌｏｇＮ１１＋Ｎ１２ｌｏｇＮ１２＋Ｎ１３ｌｏｇＮ１３＋Ｎ１４ｌｏｇＮ１４＋Ｎ２１ｌｏｇＮ２１＋Ｎ２２ｌｏｇＮ２２＋Ｎ２３ｌｏｇＮ２３＋Ｎ２４ｌｏｇＮ２４−ＺｌｏｇＺ、
ＡＩＣ（ＤＭ１２）＝−２×ＭＬＬ＋２×７、
なる計算式により算出される、
但し、ｌｏｇの底である１０は省略して表記し、Ｚ＝Ｎ１１＋Ｎ１２＋Ｎ１３＋Ｎ１４＋Ｎ２１＋Ｎ２２＋Ｎ２３＋Ｎ２４であり、Ｎ１１は前記第１の文書数、Ｎ１２は前記第２の文書数、Ｎ１３は前記第３の文書数、Ｎ１４は前記第４の文書数、Ｎ２１は前記第５の文書数、Ｎ２２は前記第６の文書数、Ｎ２３は前記第７の文書数、Ｎ２４は前記第８の文書数である、ことを特徴とする。 In the dictionary registration device according to the present invention, the third information amount reference amount “AIC (DM12)” indicating the degree of the relationship between the combination of the first word and the second word and the property is:
MLL = N11logN11 + N12logN12 + N13logN13 + N14logN14 + N21logN21 + N22logN22 + N23logN23 + N24logN24-ZlogZ,
AIC (DM12) = − 2 × MLL + 2 × 7,
Calculated by the following formula:
However, the base 10 of the log is omitted, and Z = N11 + N12 + N13 + N14 + N21 + N22 + N23 + N24, where N11 is the first document number, N12 is the second document number, N13 is the third document number, and N14 is The fourth document number, N21 is the fifth document number, N22 is the sixth document number, N23 is the seventh document number, and N24 is the eighth document number. .

本発明に係る辞書登録装置においては、単語の正規化処理を行う正規化処理部を備えたことを特徴とする。
本発明に係る辞書登録装置においては、文書から単語を抽出する形態素解析部を備えたことを特徴とする。 The dictionary registration apparatus according to the present invention includes a normalization processing unit that performs word normalization processing.
The dictionary registration apparatus according to the present invention includes a morpheme analyzer that extracts words from a document.

本発明に係る文書ラベル判定システムは、前述のいずれかの辞書登録装置と、前記辞書登録装置によって選択された登録語が特定の性質を表すラベルに対応付けて格納される辞書データベースと、前記辞書データベースを用いて入力文書に対応するラベルを判定するラベル判定装置と、を備えたことを特徴とする。 The document label determination system according to the present invention includes any one of the above dictionary registration devices, a dictionary database in which a registered word selected by the dictionary registration device is stored in association with a label representing a specific property, and the dictionary And a label determination device that determines a label corresponding to the input document using a database.

本発明に係る辞書登録プログラムは、特定の性質に関係する正解文書の集合と、当該性質に関係しない不正解文書の集合とが記憶される電子文書記憶部と、前記性質に関係する文書を判定するための辞書登録候補である第１の語と第２の語とが入力される入力部と、を有するコンピュータに、前記第１の語と前記第２の語との組み合わせ毎に、前記第１の語と前記第２の語との双方が含まれる前記正解文書の数である第１の文書数と、前記第１の語が含まれ前記第２の語が含まれない前記正解文書の数である第２の文書数と、前記第１の語が含まれず前記第２の語が含まれる前記正解文書の数である第３の文書数と、前記第１の語と前記第２の語とのいずれもが含まれない前記正解文書の数である第４の文書数と、前記第１の語と前記第２の語との双方が含まれる前記不正解文書の数である第５の文書数と、前記第１の語が含まれ前記第２の語が含まれない前記不正解文書の数である第６の文書数と、前記第１の語が含まれず前記第２の語が含まれる前記不正解文書の数である第７の文書数と、前記第１の語と前記第２の語とのいずれもが含まれない前記不正解文書の数である第８の文書数とを算出するステップと、前記算出された前記第１から前記第８の文書数に基づいて、前記第１の語と前記性質との関係の度合いを示す第１の情報量基準量と、前記第２の語と前記性質との関係の度合いを示す第２の情報量基準量と、前記第１の語および前記第２の語の組み合わせと前記性質との関係の度合いを示す第３の情報量基準量とを算出するステップと、前記算出された前記情報量基準量を比較して、前記第１の語と、前記第２の語と、前記第１の語および前記第２の語の組み合わせとのうち、前記性質に関係する度合いが最大であるものを登録語に選択するステップと、を実行させるためのコンピュータプログラムであることを特徴とする。
これにより、前述の辞書登録装置がコンピュータを利用して実現できるようになる。 The dictionary registration program according to the present invention determines an electronic document storage unit storing a set of correct documents related to a specific property and a set of incorrect documents not related to the property, and a document related to the property For each combination of the first word and the second word, a computer having an input unit to which a first word and a second word that are dictionary registration candidates for inputting are input. A first document number that is the number of the correct answer documents that include both one word and the second word; and the correct document that includes the first word and does not include the second word. A second document number that is a number, a third document number that is the number of correct documents that do not include the first word and include the second word, the first word, and the second A fourth document number that is the number of correct documents that do not include any of the words, the first word, and the second document And a sixth document that is the number of incorrect documents that includes the first word and that does not include the second word. The number of incorrect documents that are not included in the first word, the second word is included, and the seventh document number that is the number of incorrect documents including the second word, and the first word and the second word. A step of calculating an eighth document number that is the number of incorrect documents that are not included, and the first word and the property based on the calculated first to eighth document numbers A first information amount reference amount indicating the degree of the relationship, a second information amount reference amount indicating the degree of the relationship between the second word and the property, the first word, and the second word Calculating a third information amount reference amount indicating a degree of relationship between the combination of the property and the property, and the calculated information amount base Compare the quantities and register the first word, the second word, and the combination of the first word and the second word that have the highest degree of relation to the property. A computer program for executing the step of selecting words.
As a result, the dictionary registration device described above can be realized using a computer.

本発明によれば、特定の性質に関係する文書を判定するための辞書登録候補である第１の語と第２の語との組み合わせに関し、第１の語と、第２の語と、第１の語および第２の語の組み合わせとを同じ尺度で評価し、特定の性質に関係する度合いが最大であるものを辞書データベースに登録することができる。これにより、単語および単語の組み合わせを混在させた辞書データベースを生成することができる。この結果、電子文書中に出現する単語のうち、その内容を特徴的に表し、特定の性質に関連する単語または単語の組み合わせを、その電子文書中に出現する他の単語との関連性を考慮して検出することができるという格別の効果が得られる。 According to the present invention, regarding a combination of a first word and a second word that are candidates for dictionary registration for determining a document related to a specific property, the first word, the second word, A combination of one word and a second word can be evaluated on the same scale, and a word having the maximum degree related to a specific property can be registered in the dictionary database. Thereby, the dictionary database which mixed the word and the combination of the word can be produced | generated. As a result, among the words appearing in the electronic document, the content is characteristically expressed, and the word or combination of words related to a specific property is considered in relation to other words appearing in the electronic document. The special effect of being able to be detected is obtained.

本発明の一実施形態による文書ラベル判定システムのシステム構成を示す図である。It is a figure which shows the system configuration | structure of the document label determination system by one Embodiment of this invention. 同実施形態に係る２×２分割表の概念を示す図である。It is a figure which shows the concept of the 2 * 2 contingency table which concerns on the embodiment. 同実施形態に係る２×４分割表の概念を示す図である。It is a figure which shows the concept of the 2 * 4 contingency table which concerns on the same embodiment. 同実施形態による文書ラベル判定システムの動作例を示すフローチャートである。It is a flowchart which shows the operation example of the document label determination system by the same embodiment.

以下、本発明の一実施形態について、図面を参照して説明する。
図１は、本実施形態による文書ラベル判定システムのシステム構成を示す図である。図１において、文書ラベル判定システムは、辞書データベース３００と、ラベル判定装置２００と、辞書登録装置１００とを備えている。 Hereinafter, an embodiment of the present invention will be described with reference to the drawings.
FIG. 1 is a diagram showing a system configuration of a document label determination system according to the present embodiment. In FIG. 1, the document label determination system includes a dictionary database 300, a label determination device 200, and a dictionary registration device 100.

辞書データベース３００は、辞書データが記憶される記憶装置である。辞書データは、定められた単語等のラベルと、そのラベルに関連性の高い複数の索引語とが対応付けられた情報である。例えば、ラベルが「経済」であれば、その索引語として「財務省」、「為替」、「相場」、「動向」、「ドル」・・・などの単語が対応付けられる。辞書データは、「政治」、「スポーツ」などのトピックを示すそれぞれのラベルに対応付けられた複数の索引語を含むこととして良い。さらに、「スポーツ」のなかでも「サッカー」、「野球」などに階層化されたトピックをラベルとして索引語が対応付けられた辞書データを含むこととしても良い。また、辞書データには、例えば「有害」のラベルに、１８歳未満には適切でないと思われる単語を索引語として対応付けたものを含んでも良い。辞書データベース３００に記憶される辞書データは、ラベル判定装置２００が電子文書のラベル判定処理を行う際に読み出される。辞書データベース３００は、独立したコンピュータ装置を適用しても良いし、ラベル判定装置２００にインストールされたデータベースアプリケーションなどを適用しても良い。 The dictionary database 300 is a storage device that stores dictionary data. The dictionary data is information in which a label such as a defined word is associated with a plurality of index words that are highly relevant to the label. For example, if the label is “economic”, words such as “Ministry of Finance”, “Exchange”, “Exchange rate”, “Trend”, “Dollar”, etc. are associated as index words. The dictionary data may include a plurality of index words associated with respective labels indicating topics such as “politics” and “sports”. Furthermore, it is also possible to include dictionary data in which index words are associated with a topic hierarchized as “soccer”, “baseball”, etc. among “sports”. In addition, the dictionary data may include, for example, a word “harmful” associated with an index word that is considered to be inappropriate for those under 18 years of age. The dictionary data stored in the dictionary database 300 is read when the label determination apparatus 200 performs a label determination process for the electronic document. The dictionary database 300 may be an independent computer device or may be a database application installed in the label determination device 200.

ラベル判定装置２００は、辞書データベース３００に記憶されている辞書データを読み出し、読み出した辞書データと、入力される電子文書とを比較、解析して電子文書に対応するラベルを判定するコンピュータ装置である。ここで、入力される電子文書とは、例えば、ブログ等のテキストベースのウェブコンテンツや、ワープロソフトなどによって生成される文書ファイルなどのテキストデータである。ラベル判定装置２００は、ラベル判定対象となる電子文書の入力を受付け、辞書データベース３００から読み出した辞書データに含まれるラベル毎に、そのラベルに対応する索引語に一致する単語が電子文書に含まれるか否かを判定し、その電子文書が任意のラベルに該当するか否かを判定するラベル判定処理を行う。例えば、ラベル判定装置２００は、辞書データのラベルに対応する単語を、判定対象とする電子文書に含まれるテキストデータのうちから定められた閾値を超えて検出した場合には、そのラベルをその電子文書のラベルと判定する。また、例えば、ラベル判定装置２００は、ひとつの電子文書が複数のラベルに該当するか否かをそれぞれに判定し、ひとつの電子文書に該当する複数のラベルを割り当てるようにしても良い。 The label determination device 200 is a computer device that reads dictionary data stored in the dictionary database 300, compares and analyzes the read dictionary data and an input electronic document, and determines a label corresponding to the electronic document. . Here, the input electronic document is, for example, text data such as a text file such as a blog or a document file generated by word processing software. The label determination apparatus 200 receives an input of an electronic document to be a label determination target, and for each label included in dictionary data read from the dictionary database 300, a word that matches an index word corresponding to the label is included in the electronic document. Label determination processing is performed to determine whether or not the electronic document corresponds to an arbitrary label. For example, when the label determination device 200 detects a word corresponding to the label of the dictionary data exceeding a predetermined threshold from text data included in the electronic document to be determined, the label determination device 200 detects the label as the electronic data. Judged as a document label. For example, the label determination apparatus 200 may determine whether one electronic document corresponds to a plurality of labels, and may assign a plurality of labels corresponding to one electronic document.

辞書登録装置１００は、ラベル判定装置２００がラベル判定を行った電子文書とそのラベルに基づいて、そのラベルに対応する最適な索引語を再帰的に算出し、辞書データベース３００に記憶される辞書データを更新して記憶させるコンピュータ装置である。すなわち、辞書データベース３００に記憶された辞書データが、初期状態ではラベルに対して例えばユーザによって任意に定められた索引語の群が対応付けられたものであるとしても、その辞書データによりラベルに対応すると判定された電子文書から、最適な索引語を再帰的に検出して辞書データを生成することにより、例えば時事的に重要単語が変化するウェブ上のブログサイトやニュースサイトに対しても、その変化に合わせた最適なラベルを判定するための辞書データを生成することが可能となる。 The dictionary registration device 100 recursively calculates the optimal index word corresponding to the label based on the electronic document for which the label determination device 200 has performed the label determination and the label, and the dictionary data stored in the dictionary database 300. Is a computer device that updates and stores data. That is, even if the dictionary data stored in the dictionary database 300 is one in which an index word group arbitrarily determined by the user is associated with the label in the initial state, the dictionary data corresponds to the label. Then, by recursively detecting the optimal index word from the determined electronic document and generating dictionary data, for example, even for blog sites and news sites on the web where important words change over time It is possible to generate dictionary data for determining the optimum label according to the change.

辞書登録装置１００は、電子文書記憶部１１０と、正規化処理部１２０と、形態素解析部１３０と、形態素解析用辞書記憶部１４０と、単語分布算出部１５０と、単語分布表記憶部１６０と、索引語スコア算出部１７０と、文書数算出部１７５と、入力部１７６と、情報量基準量算出部１７７と、登録語選択部１７８と、辞書登録部１８０とを備えている。 The dictionary registration apparatus 100 includes an electronic document storage unit 110, a normalization processing unit 120, a morpheme analysis unit 130, a morpheme analysis dictionary storage unit 140, a word distribution calculation unit 150, a word distribution table storage unit 160, An index word score calculation unit 170, a document number calculation unit 175, an input unit 176, an information amount reference amount calculation unit 177, a registered word selection unit 178, and a dictionary registration unit 180 are provided.

電子文書記憶部１１０には、特定の性質に関係する正解文書の集合と、性質に関係しない不正解文書の集合とが記憶される。具体的には、電子文書記憶部１１０には、ラベル判定装置２００によりラベル判定が行われた電子文書と、その電子文書が特定のラベルに該当すると判定されたか否かを示すラベル判定結果が対応付けられて記憶される。ここで、電子文書記憶部１１０に記憶される電子文書には、その電子文書がブログデータである場合には、ブログ記事のテキスト本文、絵文字、ＨＴＭＬ（HyperText Markup Language）タグなどが含まれるが、画像データは含まれない。 The electronic document storage unit 110 stores a set of correct documents related to a specific property and a set of incorrect documents not related to the property. Specifically, the electronic document storage unit 110 corresponds to an electronic document that has been subjected to label determination by the label determination device 200 and a label determination result that indicates whether or not the electronic document has been determined to be a specific label. Attached and memorized. Here, when the electronic document is blog data, the electronic document stored in the electronic document storage unit 110 includes a text body of the blog article, a pictograph, an HTML (HyperText Markup Language) tag, and the like. Image data is not included.

正規化処理部１２０は、電子文書記憶部１１０に記憶されているラベル判定済みの電子文書とラベル判定結果とを入力とし、電子文書に含まれる文章、単語の正規化処理を行って正規化済電子文書を出力する。正規化処理部１２０が行う正規化処理は、例えば、以下の処理を含む。まず、ハイフン、マイナス記号、長音記号などの類似する記号を、定められたルールセットに従って正規化する。ここでは、例えばこれら全てをハイフンに変換することにより正規化する。また、半角文字を全て全角文字に変換する。また、タブ文字を全て空白文字に変換する。また、絵文字を特定の文字記号（例えば、０ｘＡ２Ａ２）に変換する。また、電子文書がブログ記事等のウェブデータである場合には、ウェブデータからＨＴＭＬタグを取り除く。また、日本語の小文字は大文字に変換する。ここでは、例えば、小文字である「ィ」を大文字の「イ」に変換する。ただし、後述する形態素解析用辞書記憶部１４０に記憶される単語で、形態素解析用辞書記憶部１４０には小文字が含まれる状態で記憶されている場合には、小文字から大文字への変換は行わない。また、ここでは、英文字の小文字は小文字のままとし、大文字へは変換しない。 The normalization processing unit 120 receives the label-determined electronic document stored in the electronic document storage unit 110 and the label determination result as input, and performs normalization processing on sentences and words included in the electronic document. Output electronic documents. The normalization process performed by the normalization processing unit 120 includes, for example, the following processes. First, similar symbols such as a hyphen, a minus symbol, and a long sound symbol are normalized according to a predetermined rule set. Here, normalization is performed by converting all of these into hyphens, for example. Also, all half-width characters are converted to full-width characters. Also, all tab characters are converted to white space characters. Also, the pictograph is converted into a specific character symbol (for example, 0xA2A2). When the electronic document is web data such as a blog article, the HTML tag is removed from the web data. Japanese lower case letters are converted to upper case letters. Here, for example, the lower case “i” is converted to the upper case “i”. However, if words stored in the morphological analysis dictionary storage unit 140 described later are stored in the morpheme analysis dictionary storage unit 140 in a state in which lowercase letters are included, conversion from lowercase letters to uppercase letters is not performed. . In this case, the lowercase letters of the English letters are kept lowercase and are not converted to uppercase.

形態素解析部１３０は、正規化処理部１２０により出力される正規化済電子文書と、その電子文書に対するラベル判定結果と、形態素解析用辞書記憶部１４０から読み出す形態素解析用辞書とを入力とし、正規化済電子文書の形態素解析処理を行って、ドキュメントベクトルテーブルを出力する。ここで、ドキュメントベクトルテーブルとは、例えば、電子文書に「私の名前は中村です」というテキストが含まれる場合に、これらを形態素解析し、「私」、「の」、「名前」、「は」、「中村」、「です」、などのように、テキストデータを形態素（意味のある最小単位）に分割し、またそれぞれの品詞を判定して品詞情報が対応付けられたデータである。 The morpheme analysis unit 130 receives the normalized electronic document output from the normalization processing unit 120, the label determination result for the electronic document, and the morpheme analysis dictionary read from the morpheme analysis dictionary storage unit 140. A morphological analysis process is performed on the digitized electronic document, and a document vector table is output. Here, the document vector table is, for example, when the text “My name is Nakamura” is included in an electronic document, and these are morphologically analyzed, and “I”, “No”, “Name”, “Ha” ”,“ Nakamura ”,“ Is ”, etc., the text data is divided into morphemes (the smallest meaningful unit), and each part of speech is determined to associate the part of speech information.

単語分布算出部１５０は、形態素解析部１３０による形態素解析処理で生成されるドキュメントベクトルテーブルに基づいて、単語分布表を生成して出力する。単語分布表は、形態素解析部１３０によってテキストデータが形態素に分割されたドキュメントベクトルテーブルから、例えば助詞や助動詞などの特定の品詞を取り除き、索引語の対象とする名詞等の単語のみを抽出した単語リストのそれぞれの単語に、電子文書中での出現頻度を示す度数が対応付けられた表である。ここで、単語分布算出部１５０は、電子文書から抽出した単語の正規化処理を行う。例えば、英単語の正規化処理を行い、英単語の小文字を大文字へ変換する。また、カタカナ単語の表記揺れの変換を行い、例えば、「タイヤモンド」なとの単語があれば、「ダイヤモンド」の文字データに変換する。このように、形態素解析処理の後に単語レベルでの正規化処理を行うことにより、例えば「西日本」などの表記がある場合、この語が「西日本」の一単語であるか、「西日」と「本」との二単語により構成される語であるのかを的確に検出することができる。 The word distribution calculation unit 150 generates and outputs a word distribution table based on the document vector table generated by the morpheme analysis processing by the morpheme analysis unit 130. The word distribution table is a word obtained by removing specific parts of speech such as particles and auxiliary verbs from the document vector table in which text data is divided into morphemes by the morphological analysis unit 130 and extracting only words such as nouns to be index words. It is the table | surface by which each frequency of the list | wrist was matched with the frequency which shows the appearance frequency in an electronic document. Here, the word distribution calculation unit 150 performs normalization processing of words extracted from the electronic document. For example, normalization processing of English words is performed, and lowercase letters of English words are converted to uppercase letters. In addition, the katakana word notation fluctuation is converted, and for example, if there is a word such as “tire tire”, it is converted into character data of “diamond”. In this way, by performing normalization processing at the word level after morphological analysis processing, for example, when there is a notation such as “West Japan”, whether this word is one word of “West Japan” or “Western” It is possible to accurately detect whether the word is composed of two words “book”.

ここで、単語分布算出部１５０は、複数の電子文書に対してひとつの単語分布表を生成することとし、正規化後の文字列が単語分布表に含まれていなければ、その正規化済み単語を単語分布表に新たに追加する。単語分布算出部１５０が生成する単語分布表において、正規化済み単語のそれぞれに対応付けられる出現頻度の度数の計算方式には、特定の単語が同一の電子文書中に出現した回数に応じて度数を加算する方式（ｔｆ：ｔｅｒｍｆｒｅｑｕｅｎｃｙ）と、同一の電子文書中に出現した回数に関わらず、特定の単語がひとつの電子文書中に出現したか否かにより度数を算出する方式（ｄｆ：ｄｏｃｕｍｅｎｔｆｒｅｑｕｅｎｃｙ）とのいずれかを適用することができる。本実施形態では、複数の電子文書中のそれぞれに単語が出現したか否かにより度数を算出するｄｆの方式を適用する。単語分布算出部１５０は、生成した単語分布表を単語分布表記憶部１６０に記憶させる。
単語分布表記憶部１６０には、単語分布算出部１５０により生成される単語分布表が記憶される。単語分布表は、上述したように、索引語の候補となる単語ごとに、その単語の電子文書中での出現頻度を示す度数が対応付けられたデータ表である。 Here, the word distribution calculation unit 150 generates one word distribution table for a plurality of electronic documents, and if the normalized character string is not included in the word distribution table, the normalized word Is newly added to the word distribution table. In the word distribution table generated by the word distribution calculation unit 150, the frequency of appearance frequency associated with each normalized word is calculated according to the number of times a specific word appears in the same electronic document. (Tf: term frequency) and a method for calculating the frequency (df: document) depending on whether a specific word appears in one electronic document regardless of the number of times it appears in the same electronic document. frequency)) can be applied. In the present embodiment, a df method for calculating the frequency based on whether or not a word appears in each of a plurality of electronic documents is applied. The word distribution calculation unit 150 stores the generated word distribution table in the word distribution table storage unit 160.
The word distribution table storage unit 160 stores a word distribution table generated by the word distribution calculation unit 150. As described above, the word distribution table is a data table in which each word that is a candidate for an index word is associated with a frequency indicating the frequency of appearance of the word in the electronic document.

索引語スコア算出部１７０は、電子文書記憶部１１０に記憶されるラベル判定結果と、単語分布表記憶部１６０に記憶される単語分布表とに基づいて、単語分布表に含まれる各単語のスコアを算出する。索引語スコア算出部１７０は、ＳＳＳ算出部１７１と、ＭＳＳ算出部１７２と、ＭＤＳ算出部１７３と、ＳＤＳ算出部１７４とを備えている。 The index word score calculation unit 170 calculates the score of each word included in the word distribution table based on the label determination result stored in the electronic document storage unit 110 and the word distribution table stored in the word distribution table storage unit 160. Is calculated. The index word score calculation unit 170 includes an SSS calculation unit 171, an MSS calculation unit 172, an MDS calculation unit 173, and an SDS calculation unit 174.

ＳＳＳ算出部１７１は、電子文書記憶部１１０に記憶されるラベル判定結果と、単語分布表記憶部１６０に記憶される単語分布表とを読み出し、電子文書中に特定の単語ｗが含まれるか否かについての２×２分割表を生成し、生成した２×２分割表に基づいたシングルスタティックスコアの単語リストＳＳＳ（Ｗ）を算出する。 The SSS calculation unit 171 reads the label determination result stored in the electronic document storage unit 110 and the word distribution table stored in the word distribution table storage unit 160, and whether or not a specific word w is included in the electronic document. A 2 × 2 contingency table is generated for ka, and a single static score word list SSS (W) based on the generated 2 × 2 contingency table is calculated.

図２は、ＳＳＳ算出部１７１が生成する２×２分割表の概念を示す図である。ここでは、ラベル判定装置２００により特定のラベル（以下、対象ラベルと称する）についてラベル判定対象となった全ての電子文書の数をＮ_ＡＬＬとし、Ｎ_ＡＬＬのうち対象ラベルに該当すると判定された電子文書の数をＮ_ＯＫとし、Ｎ_ＡＬＬのうち対象ラベルに該当しないと判定された電子文書の数をＮ_ＮＧとする。また、単語分布表に含まれる全ての単語の集合を単語集合Ｗとし、単語集合Ｗに含まれるそれぞれの単語をｗとする。ここで、対象ラベルに該当すると判定されたＮ_ＯＫ個の電子文書のうち、単語ｗが含まれる文書の数をａとする。また、対象ラベルに該当しないと判定されたＮ_ＮＧ個の電子文書のうち、単語ｗが含まれる文書の数をｃとする。また、対象ラベルに該当すると判定されたＮ_ＯＫ個の電子文書のうち、単語ｗが含まれない文書の数をｂとする。また、対象ラベルに該当しないと判定されたＮ_ＮＧ個の電子文書のうち、単語ｗが含まれない文書の数をｄとする。 FIG. 2 is a diagram illustrating the concept of the 2 × 2 contingency table generated by the SSS calculation unit 171. Here, the number of all electronic documents that are subject to label determination for a specific label (hereinafter referred to as a target label) by the label determination apparatus 200 is N _ALL, and the electronic that is determined to fall under the target label in N _ALL Let N _OK be the number of documents, and N _NG be the number of electronic documents that are determined not to fall under the target label in N _ALL . A set of all the words included in the word distribution table is a word set W, and each word included in the word set W is w. Here, it is assumed that the number of documents including the word w among the N _OK electronic documents determined to correspond to the target label is a. Also, let c be the number of documents that include the word w among N _NG electronic documents that are determined not to fall under the target label. In addition, it is assumed that the number of documents that do not include the word w among the N _OK electronic documents determined to correspond to the target label is b. Also, let d be the number of documents that do not include the word w among N _NG electronic documents determined not to fall under the target label.

このとき、以下の式が成り立つ。
・ａ＋ｃ＝ｄｆ（ｗ）（Ｎ_ＡＬＬのうち、単語ｗを含む文書の数）
・ｂ＋ｄ＝Ｎ_ＡＬＬ−ｄｆ（ｗ）
・ａ＋ｂ＝Ｎ_ＯＫ
・ａ＋ｄ＝Ｎ_ＮＧ
以下の説明において、ａ＋ｃを、ｑと表す。また、ａ＋ｂを、ｒと表す。また、ａ＋ｃ＋ｂ＋ｄを、ｚと表す。 At this time, the following equation holds.
A + c = df (w) (the number of documents including the word w in N _ALL )
B + d = N _ALL -df (w)
・ A + b = N _OK
・ A + d = N _NG
In the following description, a + c is represented as q. Moreover, a + b is represented as r. Further, a + c + b + d is represented as z.

ＳＳＳ算出部１７１は、生成した２×２分割表に基づいて、以下式（１）により、因果関係有りと仮定した場合の対数尤度値ＭＬＬ_１を求め、情報量基準量であるＡＩＣ（ＤＭ）値を算出する。以下、ｌｏｇの底である１０は省略して表記する。 Based on the generated 2 × 2 contingency table, the SSS calculation unit 171 obtains the log likelihood value MLL ₁ in the case where it is assumed that there is a causal relationship by the following equation (1), and the information amount reference amount AIC (DM ) Calculate the value. Hereinafter, 10 which is the bottom of the log is omitted.

ＭＬＬ_１＝ａｌｏｇａ＋ｃｌｏｇｃ＋ｂｌｏｇｂ＋ｄｌｏｇｄ−ｚｌｏｇｚ
ＡＩＣ（ＤＭ）＝−２×ＭＬＬ_１＋２×３
（但し、０ｌｏｇ０＝０とする）
・・・（１） MLL ₁ = loga + clogc + blogb + dlogd−zlogz
AIC (DM) =-2 × MLL ₁ + 2 × 3
(However, 0log0 = 0)
... (1)

さらに、以下式（２）により、因果関係無しと仮定した場合の対数尤度値ＭＬＬ_２を求め、情報量基準量であるＡＩＣ（ＩＭ）値を算出する。 Further, the log likelihood value MLL ₂ when it is assumed that there is no causal relationship is obtained by the following equation (2), and an AIC (IM) value that is an information amount reference amount is calculated.

ＭＬＬ_２＝ｑｌｏｇｑ＋ｒｌｏｇｒ＋（ｚ−ｑ）ｌｏｇ（ｚ−ｑ）＋（ｚ−ｒ）ｌｏｇ（ｚ−ｒ）−２ｚｌｏｇｚ
ＡＩＣ（ＩＭ）＝−２×ＭＬＬ_２＋２×２
（但し、０ｌｏｇ０＝０とする）
・・・（２） MLL ₂ = qlogq + rlogr + (z−q) log (z−q) + (z−r) log (z−r) −2zlogz
AIC (IM) = − 2 × MLL ₂ + 2 × 2
(However, 0log0 = 0)
... (2)

ここで、上記式（１）と式（２）とによって算出されたＡＩＣ（ＩＭ）値とＡＩＣ（ＤＭ）値とに基づいて、単語重要度Ｅ（ｗ）を以下式（３）または以下式（４）により算出する。 Here, based on the AIC (IM) value and the AIC (DM) value calculated by the above formulas (1) and (2), the word importance E (w) is expressed by the following formula (3) or the following formula: Calculate by (4).

ａ／（ａ＋ｃ）＞ｂ／（ｂ＋ｄ）のとき、
Ｅ（ｗ）＝ＡＩＣ（ＩＭ）−ＡＩＣ（ＤＭ）
・・・（３） When a / (a + c)> b / (b + d)
E (w) = AIC (IM) −AIC (DM)
... (3)

ａ／（ａ＋ｃ）＜ｂ／（ｂ＋ｄ）のとき、
Ｅ（ｗ）＝ＡＩＣ（ＤＭ）−ＡＩＣ（ＩＭ）
・・・（４） When a / (a + c) <b / (b + d),
E (w) = AIC (DM) -AIC (IM)
... (4)

そして、単語集合Ｗに含まれる全ての単語ｗについての単語重要度Ｅ（ｗ）を算出した後、単語重要度Ｅ（ｗ）の値を降順に並べ替えた単語リストＳＳＳ（Ｗ）を生成する。このとき、単語リストＳＳＳ（Ｗ）の単語ｗの並びは、ｗ_１、ｗ_２、・・・ｗ_ＮＡＬＬとなり、ｉ番目の単語ｗ_ｉに対するシングルスタティックスコアｓｓｓ（ｗ_ｉ）はＥ（ｗ_ｉ）となる。このようにして、ｓｓｓ（ｗ_ｉ）を降順に並べた単語リストＳＳＳ（Ｗ）を生成する。 Then, after calculating the word importance level E (w) for all the words w included in the word set W, a word list SSS (W) in which the word importance level E (w) values are rearranged in descending order is generated. . At this time, the arrangement of the words w in the word list SSS (W) is w ₁ , w ₂ ,... W _NALL , and the single static score sss (w _i ) for the i-th word w _i is E (w _i ). It becomes. In this way, to generate a sss _{(w i)} the words were arranged in descending order list SSS (W).

ＳＤＳ算出部１７４は、ＳＳＳ算出部１７１が算出した単語リストＳＳＳ（Ｗ）と、単語集合Ｗに含まれるそれぞれの単語ｗ_ｉに対応する２×２分割表とを入力として、単語集合Ｗに含まれるそれぞれの単語ｗ_ｉについてのシングルダイナミックスコアｓｄｓ（ｗ_ｉ）の単語リストＳＤＳ（Ｗ）を算出する。ここで、ＳＳＳ算出部１７１によってシングルススタティックスコアｓｓｓ（ｗ_ｉ）が降順に並べられた単語リストであるＳＳＳ（Ｗ）を、集合Ｃ（Ｃ＝｛ｗ_１、ｗ_２、・・・ｗ_ＮＡＬＬ｝）とする。また、単語集合Ｗに含まれるそれぞれの単語ｗ_ｉをｓｄｓ（ｗ_ｉ）の値により降順に並べる単語の集合をＬとする。初期状態では、Ｌ＝｛｝（空集合）である。 The SDS calculation unit 174 includes the word list SSS (W) calculated by the SSS calculation unit 171 and the 2 × 2 contingency table corresponding to each word w _i included in the word set W as inputs, and is included in the word set W. The word list SDS (W) of the single dynamic score sds (w _i ) for each word w _i to be calculated is calculated. Here, SSS (W), which is a word list in which _single _sstatic scores sss (w _i ) are arranged in descending order by the SSS calculator 171, is represented as a set C (C = {w ₁ , w ₂ ,... W _NALL }). ). Also, let L be a set of words in which the words w _i included in the word set W are arranged in descending order according to the value of sds (w _i ). In the initial state, L = {} (empty set).

ＳＤＳ算出部１７４は、集合Ｃの中から、ｓｓｓ（ｗ_ｉ）が最大となるｗ_ｉを求める。そして、Ｃからｗ_ｉを除き（Ｃ＝Ｃ−｛ｗ_ｉ｝）、ｓｓｓ（ｗ_ｉ）を仮のｓｄｓ（ｗ_ｉ）とする（ｓｄｓ（ｗ_ｉ）＝ｓｓｓ（ｗ_ｉ））。ここで、対象ラベルに該当すると判定されたＮ_ＯＫ個の電子文書のうち、単語ｗ_ｉと他の任意の単語ｗ_ｊとが含まれる文書の数をｎ_１１（_ｉｊ）とする。また、対象ラベルに該当しないと判定されたＮ_ＮＧ個の電子文書のうち、単語ｗ_ｉと他の任意の単語ｗ_ｊとが含まれる文書の数をｎ_１２（_ｉｊ）とする。また、対象ラベルに該当すると判定されたＮ_ＯＫ個の電子文書のうち、単語ｗ_ｉと他の任意の単語ｗ_ｊとが含まれない文書の数をｎ_２１（_ｉｊ）とする。また、対象ラベルに該当しないと判定されたＮ_ＮＧ個の電子文書のうち、単語ｗ_ｉと他の任意の単語ｗ_ｊとが含まれない文書の数をｎ_２２（_ｉｊ）とする。そして、Ｃのうちの他の単語ｗ_ｊについて、ＳＳＳ算出部１７１が生成した２×２分割表の各値ａ，ｂ，ｃ，ｄを、以下のように更新する。 The SDS calculation unit 174 obtains w _i that maximizes sss (w _i ) from the set C. Then, w _i is removed from C (C = C− {w _i }), and sss (w _i ) is set as temporary sds (w _i ) (sds (w _i ) = sss (w _i )). Here, out of N _OK electronic documents determined to correspond to the target label, the number of documents including the word w _i and another arbitrary word w _j is n ₁₁ ( _ij ). In addition, the number of documents including the word w _i and any other word w _j among N _NG electronic documents determined not corresponding to the target label is n ₁₂ ( _ij ). Also, the number of documents that do not include the word w _i and any other arbitrary word w _j among the N _OK electronic documents determined to correspond to the target label is n ₂₁ ( _ij ). In addition, the number of documents that do not include the word w _i and any other arbitrary word w _j among N _NG electronic documents determined not to correspond to the target label is n ₂₂ ( _ij ). Then, the values a, b, c, and d in the 2 × 2 contingency table generated by the SSS calculation unit 171 are updated as follows for other words w _j in C.

・ａ＝ａ−ｎ_１１（_ｉｊ）
・ｃ＝ｃ−ｎ_１２（_ｉｊ）
・ｂ＝ｂ−ｎ_２１（_ｉｊ）
・ｄ＝ｄ−ｎ_２２（_ｉｊ） A = a−n ₁₁ ( _ij )
C = c−n ₁₂ ( _ij )
B = b−n ₂₁ ( _ij )
D = dn ₂₂ ( _ij )

そして、単語ｗ_ｊについての２×２分割表から、上記式（１）、上記式（２）、上記式（３）、上記式（４）に倣って単語重要度Ｅ（ｗ_ｊ）を算出する。ＳＤＳ算出部１７４は、Ｃに含まれる単語のうち、単語重要度Ｅ（ｗ_ｊ）の値が最も大きくなるｗ_ｊを求め、単語重要度Ｅ（ｗ_ｊ）を、ｓｄｓ（ｗ_ｊ）として集合Ｌに追加する（Ｌ＝Ｌ＋｛ｗ_ｊ｝）。
ＳＤＳ算出部１７４は、集合Ｃが空集合になるまで、集合Ｃの中からｓｓｓ（ｗ_ｉ）が最大となるｗ_ｉを求める処理から、最もｓｄｓ（ｗ_ｊ）の値が大きくなるｗ_ｊを集合Ｌに追加するまでの処理を繰り返す。これにより、ｓｄｓ（ｗ）を降順に並べた単語リストＳＤＳ（Ｗ）を求めることができる。この単語リストＳＤＳ（Ｗ）は、全ての単語ｗについて、その単語ｗより上位の単語の影響を除いた状態でのスコア順に並べられたリストとなる。 Then, the word importance E (w _j ) is calculated from the 2 × 2 contingency table for the word w _{j according} to the above formula (1), the above formula (2), the above formula (3), and the above formula (4). To do. The SDS calculation unit 174 obtains w _j having the largest word importance E (w _j ) among the words included in C, and sets the word importance E (w _j ) as sds (w _j ). Append to L (L = L + {w _j }).
SDS calculator 174 until the set C is an empty set, the process for obtaining _{w i} where sss _{(w i)} is the maximum from the set C, the _{w j} where the value of the most sds _{(w j)} increases The process until it is added to the set L is repeated. Thereby, the word list SDS (W) in which sds (w) is arranged in descending order can be obtained. The word list SDS (W) is a list in which all the words w are arranged in the order of scores in a state where the influence of words higher than the word w is excluded.

ＭＳＳ算出部１７２は、ＳＳＳ算出部１７１が算出したＳＳＳ（Ｗ）と、単語集合Ｗに含まれるそれぞれの単語ｗ_ｉに対応する２×２分割表とを入力として、単語集合Ｗに含まれるそれぞれの単語ｗ_ｉについてのマルチスタティックスコアの単語リストＭＳＳ（Ｗ）を算出する。ここで、単語集合Ｗに含まれる任意の単語の組み合わせωの集合をＧとする。初期状態では、Ｇ＝｛｝（空集合）である。また、Ｇに追加した組み合わせωの数を示す変数をｕとする。初期状態では、ｕ＝０である。 The MSS calculation unit 172 receives the SSS (W) calculated by the SSS calculation unit 171 and the 2 × 2 contingency table corresponding to each word w _i included in the word set W, and is included in each word set W. The word list MSS (W) of the multi-static score for the word w _i is calculated. Here, G is a set of arbitrary word combinations ω included in the word set W. In the initial state, G = {} (empty set). A variable indicating the number of combinations ω added to G is u. In the initial state, u = 0.

ＭＳＳ算出部１７２は、単語集合Ｗのうち、ｓｓｓ（ｗ_ｉ）が最も大きい単語ｗ_ｉを求める。また、単語ｗ_ｉ以外に、シングルスタティックスコアが大きいｍ個の単語ｗ_１〜ｗ_ｍを求める（ｍは、定められた任意の数）。そして、単語ｗ_ｉと単語ｗ_ｊ（１≦ｊ≦ｍ）との組み合わせω_ｊに対する、それぞれの２×２分割表を算出する。ここでは、対象ラベルに該当すると判定されたＮ_ＯＫ個の電子文書のうち、単語ｗ_ｉと単語ｗ_ｊとの組み合わせω_ｊが含まれる文書の数をａ（ω_ｊ）とする。また、対象ラベルに該当しないと判定されたＮ_ＮＧ個の電子文書のうち、単語ｗ_ｉと単語ｗ_ｊとの組み合わせω_ｊが含まれる文書の数をｃ（ω_ｊ）とする。また、対象ラベルに該当すると判定されたＮ_ＯＫ個の電子文書のうち、単語ｗ_ｉと単語ｗ_ｊとの組み合わせω_ｊが含まれない文書の数をｂ（ω_ｊ）とする。また、対象ラベルに該当しないと判定されたＮ_ＮＧ個の電子文書のうち、単語ｗ_ｉと単語ｗ_ｊとの組み合わせω_ｊが含まれない文書の数をｄ（ω_ｊ）とする。 The MSS calculation unit 172 obtains the word w _i having the largest sss (w _i ) from the word set W. In addition to word _{w i,} obtaining the larger the m words _w 1 to w _m single static score (m is any number that is determined). Then, each 2 × 2 contingency table for the combination ω _j of the word w _i and the word w _j (1 ≦ j ≦ m) is calculated. Here, it is assumed that the number of documents including the combination ω _j of the word w _i and the word w _j among the N _OK electronic documents determined to correspond to the target label is a (ω _j ). Also, the number of documents including the combination ω _j of the word w _i and the word w _j among the N _NG electronic documents determined not to correspond to the target label is c (ω _j ). Also, the number of documents that do not include the combination ω _j of the word w _i and the word w _j among the N _OK electronic documents determined to correspond to the target label is b (ω _j ). Further, it is assumed that the number of documents that do not include the combination ω _j of the word w _i and the word w _j among the N _NG electronic documents determined not to correspond to the target label is d (ω _j ).

このとき、以下の式が成り立つ。
・ａ（ω_ｊ）＋ｃ（ω_ｊ）＝ｄｆ（ω_ｊ）（Ｎ_ＡＬＬのうち、組み合わせω_ｊを含む文書の数）
・ｂ（ω_ｊ）＋ｄ（ω_ｊ）＝Ｎ_ＡＬＬ−ｄｆ（ω_ｊ）
・ａ（ω_ｊ）＋ｂ（ω_ｊ）＝Ｎ_ＯＫ
・ａ（ω_ｊ）＋ｄ（ω_ｊ）＝Ｎ_ＮＧ At this time, the following equation holds.
A (ω _j ) + c (ω _j ) = df (ω _j ) (the number of documents including the combination ω _j in N _ALL )
B (ω _j ) + d (ω _j ) = N _ALL −df (ω _j )
・ A (ω _j ) + b (ω _j ) = N _OK
・ A (ω _j ) + d (ω _j ) = N _NG

ここで、単語ｗ_ｉごとに繰り返して組み合わせω_ｊについての２×２分割表を算出する処理中に、既にｗ_ｉとｗ_ｊとの組み合わせω_ｊに対する２×２分割表についての算出を行っている場合には、再算出しない。このように同一の組み合わせについて重複して２×２分割表を算出しないようにすれば、演算量を減らすことができる。そして、上記式（１）、上記式（２）、上記式（３）、上記式（４）に倣って算出する重要度Ｅ（ω_ｊ）を、ＭＳＳ（ω_ｊ）とし、単語ｗ_ｉと単語ｗ_ｊ（１≦ｊ≦ｍ）の組み合わせω_ｊのうち、最もＭＳＳ（ω_ｊ）の値が大きくなるω_ｊを求める。ここで、ω_ｊを、組み合わせ集合Ｇに追加する（Ｇ＝Ｇ＋｛ω_ｊ｝）。また、変数ｕをインクリメントする（ｕ＝ｕ＋１）。ここで、単語ｗ_ｉを除く単語集合Ｗに含まれる単語ｗ_ｋのうち、ｓｓｓ（ｗ_ｋ）が最も大きい単語ｗ_ｋを、重要度判定対象の単語ｗ_ｉとして、単語ｗ_ｉ以外にシングルスタティックスコアｓｓｓ（ｗ_ｊ）が大きいｍ個の単語ｗ_１〜ｗ_ｍを求める処理から、最もＭＳＳ（ω_ｊ）の値が大きくなる単語の組み合わせを求めて組み合わせ集合Ｇに追加する処理を繰り返す。これにより、ｍｓｓ（ｗ）をスコアの降順に並べた単語リストＭＳＳ（Ｗ）を求めることができる。 Here, during the process of calculating the 2 × 2 contingency table for the combination ω _j repeatedly for each word w _i , the calculation for the 2 × 2 contingency table for the combination ω _j of w _i and w _j is already performed. If yes, do not recalculate. Thus, if the 2 × 2 contingency table is not calculated redundantly for the same combination, the amount of calculation can be reduced. Then, the importance level E (ω _j ) calculated in accordance with the above formula (1), the above formula (2), the above formula (3), and the above formula (4) is MSS (ω _j ), and the word w _i among the combinations omega _j of word _{w j (1 ≦ j ≦ m} ), determine the most value for MSS (omega _j) is increased omega _j. Here, ω _j is added to the combination set G (G = G + {ω _j }). Also, the variable u is incremented (u = u + 1). Here, among the word _{w k} that is included in the word set W except for the word _{w i,} sss the _{(w k)} is the largest word _{w k,} as a word _{w i} of the importance of the determination target, single static in addition to word _{w i} From the process of obtaining _m words w _{1 to} w m having a large score sss (w _j ), the process of obtaining a combination of words having the largest MSS (ω _j ) value and adding it to the combination set G is repeated. Thereby, a word list MSS (W) in which mss (w) is arranged in descending order of scores can be obtained.

ＭＤＳ算出部１７３は、ＭＳＳ算出部１７２が算出した単語リストＭＳＳ（Ｗ）と、Ｗに含まれる単語ｗ_ｉと単語ｗ_ｊとの組み合わせω_ｊに対する２×２分割表とを入力として、単語集合Ｗに含まれるそれぞれの単語ｗ_ｉと他の単語との組み合わせω_ｉについてのマルチダイナミックスコアｍｄｓ（ω_ｉ）の単語リストＭＤＳ（Ｗ）を算出する。ここで、ＭＳＳ算出部１７２によってマルチスタティックスコアｓｓｓ（ｗ_ｉ）が降順に並べられた単語リストであるＭＳＳ（Ｗ）を、集合Ｃ（Ｃ＝｛ω_１、ω_２、・・・ω_ＮＡＬＬ｝）とする。また、単語の組み合わせωについてのｍｄｓ（ω_ｉ）の値により降順に並べる単語の集合をＬとする。初期状態では、Ｌ＝｛｝（空集合）である。 The MDS calculation unit 173 receives the word list MSS (W) calculated by the MSS calculation unit 172 and the 2 × 2 contingency table for the combination ω _j of the word w _i and the word w _j included in W as an input word set The word list MDS (W) of the multi-dynamic score mds (ω _i ) for the combination ω _i of each word w _i and other words included in W is calculated. Here, the MSS (W), which is a word list in which the multistatic scores sss (w _i ) are arranged in descending order by the MSS calculation unit 172, is represented as a set C (C = {ω ₁ , ω ₂ ,... Ω _NALL } ). Also, let L be a set of words arranged in descending order according to the value of mds (ω _i ) for the word combination ω. In the initial state, L = {} (empty set).

ＭＤＳ算出部１７３は、集合Ｃの中から、ｍｄｓ（ω_ｉ）が最大となるω_ｉを求める。そして、Ｃからω_ｉを除き（Ｃ＝Ｃ−｛ω_ｉ｝）、ｍｓｓ（ω_ｉ）を仮のｍｄｓ（ω_ｉ）とする（ｍｄｓ（ω_ｉ）＝ｍｓｓ（ω_ｉ））。ここで、対象ラベルに該当すると判定されたＮ_ＯＫ個の電子文書のうち、単語の組み合わせω_ｉと他の任意の単語の組み合わせω_ｊとが含まれる文書の数をｎ_１１（_ｉｊ）とする。また、対象ラベルに該当しないと判定されたＮ_ＮＧ個の電子文書のうち、単語の組み合わせω_ｉと他の任意の単語の組み合わせω_ｊとが含まれる文書の数をｎ_１２（_ｉｊ）とする。また、対象ラベルに該当すると判定されたＮ_ＯＫ個の電子文書のうち、単語の組み合わせω_ｉと他の任意の単語の組み合わせω_ｊとが含まれない文書の数をｎ_２１（_ｉｊ）とする。また、対象ラベルに該当しないと判定されたＮ_ＮＧ個の電子文書のうち、単語の組み合わせω_ｉと他の任意の単語の組み合わせω_ｊとが含まれない文書の数をｎ_２２（_ｉｊ）とする。そして、Ｃのうちの他の単語の組み合わせω_ｊについて、ＭＳＳ算出部１７２が生成した２×２分割表の各値ａ（ω_ｊ），ｂ（ω_ｊ），ｃ（ω_ｊ），ｄ（ω_ｊ）を、以下のように更新する。 MDS calculation unit 173, from the set C, obtaining the omega _i that mds (ω _i) is maximized. Then, ω _i is excluded from C (C = C− {ω _i }), and m ss (ω _i ) is assumed to be temporary mds (ω _i ) (mds (ω _i ) = mss (ω _i )). Here, out of N _OK electronic documents determined to correspond to the target label, the number of documents including a word combination ω _i and another arbitrary word combination ω _j is n ₁₁ ( _ij ). . Further, out of N _NG electronic documents determined not to correspond to the target label, the number of documents including a word combination ω _i and another arbitrary word combination ω _j is n ₁₂ ( _ij ). . In addition, among N _OK electronic documents determined to correspond to the target label, the number of documents that do not include the word combination ω _i and another arbitrary word combination ω _j is n ₂₁ ( _ij ). . In addition, among N _NG electronic documents determined not to correspond to the target label, the number of documents that do not include the word combination ω _i and any other arbitrary word combination ω _j is represented by n ₂₂ ( _ij ). To do. Then, the values a (ω _j ), b (ω _j ), c (ω _j ), d (of the 2 × 2 contingency table generated by the MSS calculation unit 172 for other word combinations ω _j in C. ω _j ) is updated as follows:

・ａ（ω_ｊ）＝ａ（ω_ｊ）−ｎ_１１（_ｉｊ）
・ｃ（ω_ｊ）＝ｃ（ω_ｊ）−ｎ_１２（_ｉｊ）
・ｂ（ω_ｊ）＝ｂ（ω_ｊ）−ｎ_２１（_ｉｊ）
・ｄ（ω_ｊ）＝ｄ（ω_ｊ）−ｎ_２２（_ｉｊ） A (ω _j ) = a (ω _j ) −n ₁₁ ( _ij )
C (ω _j ) = c (ω _j ) −n ₁₂ ( _ij )
B (ω _j ) = b (ω _j ) −n ₂₁ ( _ij )
D (ω _j ) = d (ω _j ) −n ₂₂ ( _ij )

そして、更新した単語の組み合わせω_ｊについての２×２分割表から、上記式（１）、上記式（２）、上記式（３）、上記式（４）に倣って単語重要度Ｅ（ω_ｊ）を算出する。ＳＤＳ算出部１７４は、Ｃに含まれる単語の組み合わせのうち、単語重要度Ｅ（ω_ｊ）の値が最も大きくなるｗ_ｊを求め、単語重要度Ｅ（ω_ｊ）を、ｍｄｓ（ω_ｊ）として集合Ｌに追加する（Ｌ＝Ｌ＋｛ω_ｊ｝）。 Then, from the 2 × 2 contingency table for the updated word combination ω _j , the word importance E (ω is followed according to the above formula (1), the above formula (2), the above formula (3), and the above formula (4). _j ) is calculated. The SDS calculation unit 174 obtains w _j that maximizes the value of the word importance level E (ω _j ) among the combinations of words included in C, and calculates the word importance level E (ω _j ) as mds (ω _j ). To the set L (L = L + {ω _j }).

ＭＤＳ算出部１７３は、集合Ｃが空集合になるまで、集合Ｃの中からｍｓｓ（ω_ｉ）が最大となるω_ｉを求める処理から、最もｍｄｓ（ω_ｊ）の値が大きくなるω_ｊを集合Ｌに追加するまでの処理を繰り返す。これにより、ｍｄｓ（ω）を降順に並べた単語リストＭＤＳ（Ｗ）を求めることができる。この単語リストＭＤＳ（Ｗ）は、全ての単語の組み合わせωについて、その単語の組み合わせωより上位の単語の影響を除いた状態でのスコア順に並べられたリストとなる。 MDS calculation unit 173, until the set C is an empty set, the process of obtaining the omega _i which mss (ω _i) is the maximum from the set C, the most value of mds (omega _j) is increased omega _j The process until it is added to the set L is repeated. Thereby, the word list MDS (W) in which mds (ω) is arranged in descending order can be obtained. This word list MDS (W) is a list in which all word combinations ω are arranged in the order of scores in a state in which the influence of words higher than the word combination ω is excluded.

文書数算出部１７５は、対象ラベルに該当する文書を判定するための辞書登録候補である第１の単語ｗ１と第２の単語ｗ２との組み合わせ毎に、２×４分割表を生成し、第１から第８の文書数Ｎ１１，Ｎ１２，Ｎ１３，Ｎ１４，Ｎ２１，Ｎ２２，Ｎ２３，Ｎ２４を算出する。 The document number calculation unit 175 generates a 2 × 4 contingency table for each combination of the first word w1 and the second word w2, which are dictionary registration candidates for determining a document corresponding to the target label, The first to eighth document numbers N11, N12, N13, N14, N21, N22, N23, and N24 are calculated.

図３は、文書数算出部１７５が生成する２×４分割表の概念を示す図である。図３において、対象ラベルについてラベル判定対象となった全ての電子文書の数をＮ_ＡＬＬとし、Ｎ_ＡＬＬのうち対象ラベルに該当すると判定された電子文書（正解文書）の数をＮ_ＯＫとし、Ｎ_ＡＬＬのうち対象ラベルに該当しないと判定された電子文書（不正解文書）の数をＮ_ＮＧとする。このとき、第１から第８の文書数Ｎ１１，Ｎ１２，Ｎ１３，Ｎ１４，Ｎ２１，Ｎ２２，Ｎ２３，Ｎ２４は、以下に定義されるものである。 FIG. 3 is a diagram illustrating the concept of the 2 × 4 contingency table generated by the document number calculation unit 175. In FIG. 3, the number of all electronic documents that are subject to label determination for the target label is N _ALL, and the number of electronic documents (correct documents) determined to be applicable to the target label among N _ALL is N _OK. Let N _NG be the number of electronic documents (incorrect answer documents) determined not to fall under the target label in _ALL . At this time, the first to eighth document numbers N11, N12, N13, N14, N21, N22, N23, and N24 are defined as follows.

第１の文書数Ｎ１１は、Ｎ_ＯＫ個の正解文書のうち、第１の単語ｗ１を含み、且つ、第２の単語ｗ２を含む文書の数である。
第２の文書数Ｎ１２は、Ｎ_ＯＫ個の正解文書のうち、第１の単語ｗ１を含み、且つ、第２の単語ｗ２を含まない文書の数である。
第３の文書数Ｎ１３は、Ｎ_ＯＫ個の正解文書のうち、第１の単語ｗ１を含まない、且つ、第２の単語ｗ２を含む文書の数である。
第４の文書数Ｎ１４は、Ｎ_ＯＫ個の正解文書のうち、第１の単語ｗ１を含まない、且つ、第２の単語ｗ２を含まない文書の数である。 The first document number N11 is the number of documents including the first word w1 and including the second word w2 among the N _OK correct documents.
The second document number N12 is the number of documents that include the first word w1 and does not include the second word w2 among the N _OK correct documents.
The third document number N13 is the number of documents that do not include the first word w1 and includes the second word w2 out of N _OK correct documents.
The fourth document number N14 is the number of documents that do not include the first word w1 and does not include the second word w2 among the N _OK correct documents.

第５の文書数Ｎ２１は、Ｎ_ＮＧ個の不正解文書のうち、第１の単語ｗ１を含み、且つ、第２の単語ｗ２を含む文書の数である。
第６の文書数Ｎ２２は、Ｎ_ＮＧ個の不正解文書のうち、第１の単語ｗ１を含み、且つ、第２の単語ｗ２を含まない文書の数である。
第７の文書数Ｎ２３は、Ｎ_ＮＧ個の不正解文書のうち、第１の単語ｗ１を含まない、且つ、第２の単語ｗ２を含む文書の数である。
第８の文書数Ｎ２４は、Ｎ_ＮＧ個の不正解文書のうち、第１の単語ｗ１を含まない、且つ、第２の単語ｗ２を含まない文書の数である。 The fifth document number N21 is the number of documents including the first word w1 and the second word w2 out of N _NG incorrect answer documents.
The sixth document number N22 is the number of documents including the first word w1 and not including the second word w2 out of N _NG incorrect answer documents.
The seventh document number N23 is the number of documents that do not include the first word w1 and includes the second word w2 among the N _NG incorrect answer documents.
The eighth document number N24 is the number of documents that do not include the first word w1 and does not include the second word w2 among the N _NG incorrect answer documents.

このとき、以下の関係式が成り立つ。
・Ｎ１２＝「Ｎ_ＯＫ個の正解文書のうち、第１の単語ｗ１を含む文書の数ａ（ｗ１）」−Ｎ１１
・Ｎ１３＝「Ｎ_ＯＫ個の正解文書のうち、第２の単語ｗ２を含む文書の数ａ（ｗ２）」−Ｎ１１
・Ｎ２２＝「Ｎ_ＮＧ個の不正解文書のうち、第１の単語ｗ１を含む文書の数ｃ（ｗ１）」−Ｎ２１
・Ｎ２３＝「Ｎ_ＮＧ個の不正解文書のうち、第２の単語ｗ２を含む文書の数ｃ（ｗ２）」−Ｎ２１
・Ｚ＝Ｎ１１＋Ｎ１２＋Ｎ１３＋Ｎ１４＋Ｎ２１＋Ｎ２２＋Ｎ２３＋Ｎ２４
Ｚは全文書の数である。 At this time, the following relational expression holds.
N12 = “the number a (w1) of documents including the first word w1 out of the N _OK correct documents” −N11
N13 = “the number a (w2) of documents including the second word w2 out of the N _OK correct documents” −N11
N22 = “the number c (w1) of documents including the first word w1 among N _NG incorrect documents” −N21
N23 = “number of documents including the second word w2 out of N _NG incorrect documents c (w2)” − N21
・ Z = N11 + N12 + N13 + N14 + N21 + N22 + N23 + N24
Z is the number of all documents.

ａ（ｗ１）は、索引語スコア算出部１７０で生成された、第１の単語ｗ１に係る２×２分割表のａである。ａ（ｗ２）は、索引語スコア算出部１７０で生成された、第２の単語ｗ２に係る２×２分割表のａである。ｃ（ｗ１）は、索引語スコア算出部１７０で生成された、第１の単語ｗ１に係る２×２分割表のｃである。ｃ（ｗ２）は、索引語スコア算出部１７０で生成された、第２の単語ｗ２に係る２×２分割表のｃである。このため、文書数算出部１７５は、索引語スコア算出部１７０から、第１の単語ｗ１に係る２×２分割表のａ，ｃと第２の単語ｗ２に係る２×２分割表のａ，ｃを取得し、上記関係式により文書数Ｎ１２，Ｎ１３，Ｎ２２，Ｎ２３を算出する。これにより、計算量を削減できる。 a (w1) is a of the 2 × 2 contingency table related to the first word w1 generated by the index word score calculation unit 170. a (w2) is a of the 2 × 2 contingency table related to the second word w2 generated by the index word score calculation unit 170. c (w1) is c of the 2 × 2 contingency table generated by the index word score calculation unit 170 and related to the first word w1. c (w2) is c of the 2 × 2 contingency table related to the second word w2 generated by the index word score calculation unit 170. Therefore, the number-of-documents calculation unit 175 receives from the index word score calculation unit 170 a, c of the 2 × 2 contingency table relating to the first word w1 and a, c of the 2 × 2 contingency table relating to the second word w2. c is obtained, and the document numbers N12, N13, N22, and N23 are calculated by the above relational expression. Thereby, the amount of calculation can be reduced.

入力部１７６は、第１の単語ｗ１と第２の単語ｗ２を文書数算出部１７５へ入力する。ここで、第１の単語ｗ１と第２の単語ｗ２の選択方法としては、索引語スコア算出部１７０で算出された、対象ラベルに係る各単語のスコアを利用することができる。例えば、索引語スコア算出部１７０で算出された、シングルスタティックスコアの単語リストＳＳＳ（Ｗ）、シングルダイナミックスコアの単語リストＳＤＳ（Ｗ）、マルチスタティックスコアの単語リストＭＳＳ（Ｗ）及びマルチダイナミックスコアの単語リストＭＤＳ（Ｗ）のうち、いずれかの単語リストに従って、対象ラベルに係る重要単語を所定数だけ選択し、該選択した単語群の中から順次、第１の単語ｗ１と第２の単語ｗ２を選択して文書数算出部１７５へ入力する。 The input unit 176 inputs the first word w1 and the second word w2 to the document number calculation unit 175. Here, as a method of selecting the first word w1 and the second word w2, the score of each word related to the target label calculated by the index word score calculation unit 170 can be used. For example, a single static score word list SSS (W), a single dynamic score word list SDS (W), a multi static score word list MSS (W), and a multi dynamic score calculated by the index word score calculation unit 170 A predetermined number of important words related to the target label are selected from the word list MDS (W) according to one of the word lists, and the first word w1 and the second word w2 are sequentially selected from the selected word group. Is input to the document number calculation unit 175.

情報量基準量算出部１７７は、文書数算出部１７５によって算出された第１から第８の文書数Ｎ１１，Ｎ１２，Ｎ１３，Ｎ１４，Ｎ２１，Ｎ２２，Ｎ２３，Ｎ２４を用いて、第１から第３の情報量基準量ＡＩＣ（ＤＭ１），ＡＩＣ（ＤＭ２），ＡＩＣ（ＤＭ１２）を算出する。 The information amount reference amount calculation unit 177 uses the first to eighth document numbers N11, N12, N13, N14, N21, N22, N23, and N24 calculated by the document number calculation unit 175 to perform the first to third. Information amount reference amounts AIC (DM1), AIC (DM2), and AIC (DM12) are calculated.

第１の情報量基準量ＡＩＣ（ＤＭ１）は、対象ラベルに該当する文書を判定するための辞書登録候補である第１の単語ｗ１と第２の単語ｗ２との組み合わせに関し、対象ラベルと第１の単語ｗ１との関係の度合いを示す。第１の情報量基準量ＡＩＣ（ＤＭ１）は、式（５）により算出される。なお、以下の式（５），（６），（７）において、ｌｏｇの底である１０は省略して表記している。 The first information amount reference amount AIC (DM1) is related to the combination of the first word w1 and the second word w2, which are dictionary registration candidates for determining the document corresponding to the target label, and the first label The degree of the relationship with the word w1 is shown. The first information amount reference amount AIC (DM1) is calculated by the equation (5). In the following formulas (5), (6), and (7), 10 that is the bottom of the log is omitted.

ＭＬＬ＝（Ｎ１１＋Ｎ１２）ｌｏｇ（Ｎ１１＋Ｎ１２）＋（Ｎ１３＋Ｎ１４）ｌｏｇ（Ｎ１３＋Ｎ１４）＋（Ｎ２１＋Ｎ２２）ｌｏｇ（Ｎ２１＋Ｎ２２）＋（Ｎ２３＋Ｎ２４）ｌｏｇ（Ｎ２３＋Ｎ２４）＋（Ｎ１１＋Ｎ１３＋Ｎ２１＋Ｎ２３）ｌｏｇ（Ｎ１１＋Ｎ１３＋Ｎ２１＋Ｎ２３）＋（Ｎ１２＋Ｎ１４＋Ｎ２２＋Ｎ２４）ｌｏｇ（Ｎ１２＋Ｎ１４＋Ｎ２２＋Ｎ２４）−２×ＺｌｏｇＺ
ＡＩＣ（ＤＭ１）＝−２×ＭＬＬ＋２×４
・・・（５） MLL = (N11 + N12) log (N11 + N12) + (N13 + N14) log (N13 + N14) + (N21 + N22) log (N21 + N22) + (N23 + N24) log (N23 + N24) + (N11 + N13 + N21 + N24) 2 x ZlogZ
AIC (DM1) = − 2 × MLL + 2 × 4
... (5)

この第１の情報量基準量ＡＩＣ（ＤＭ１）は、対象ラベルに該当する文書を判定するための辞書登録候補である第１の単語ｗ１と第２の単語ｗ２との組み合わせに関し、第１の単語ｗ１のみを辞書登録すべきかの尺度として利用することができる。 The first information amount reference amount AIC (DM1) is the first word related to the combination of the first word w1 and the second word w2 that are dictionary registration candidates for determining the document corresponding to the target label. Only w1 can be used as a measure for dictionary registration.

第２の情報量基準量ＡＩＣ（ＤＭ２）は、対象ラベルに該当する文書を判定するための辞書登録候補である第１の単語ｗ１と第２の単語ｗ２との組み合わせに関し、対象ラベルと第２の単語ｗ２との関係の度合いを示す。第２の情報量基準量ＡＩＣ（ＤＭ２）は、式（６）により算出される。 The second information amount reference amount AIC (DM2) is related to the combination of the first word w1 and the second word w2 which are dictionary registration candidates for determining a document corresponding to the target label, Indicates the degree of relationship with the word w2. The second information amount reference amount AIC (DM2) is calculated by the equation (6).

ＭＬＬ＝（Ｎ１１＋Ｎ１３）ｌｏｇ（Ｎ１１＋Ｎ１３）＋（Ｎ１２＋Ｎ１４）ｌｏｇ（Ｎ１２＋Ｎ１４）＋（Ｎ２１＋Ｎ２３）ｌｏｇ（Ｎ２１＋Ｎ２３）＋（Ｎ２２＋Ｎ２４）ｌｏｇ（Ｎ２２＋Ｎ２４）＋（Ｎ１１＋Ｎ１２＋Ｎ２１＋Ｎ２２）ｌｏｇ（Ｎ１１＋Ｎ１２＋Ｎ２１＋Ｎ２２）＋（Ｎ１３＋Ｎ１４＋Ｎ２３＋Ｎ２４）ｌｏｇ（Ｎ１３＋Ｎ１４＋Ｎ２３＋Ｎ２４）−２×ＺｌｏｇＺ
ＡＩＣ（ＤＭ２）＝−２×ＭＬＬ＋２×４
・・・（６） MLL = (N11 + N13) log (N11 + N13) + (N12 + N14) log (N12 + N14) + (N21 + N23) log (N21 + N23) + (N22 + N24) log (N22 + N24) + (N11 + N12 + N21 + N21 + N14 + N21 + N21 + N14 + N14 + N14 + N24 + N24 + N24 2 x ZlogZ
AIC (DM2) = − 2 × MLL + 2 × 4
... (6)

この第２の情報量基準量ＡＩＣ（ＤＭ２）は、対象ラベルに該当する文書を判定するための辞書登録候補である第１の単語ｗ１と第２の単語ｗ２との組み合わせに関し、第２の単語ｗ２のみを辞書登録すべきかの尺度として利用することができる。 The second information amount reference amount AIC (DM2) is a second word related to a combination of the first word w1 and the second word w2 that are dictionary registration candidates for determining a document corresponding to the target label. Only w2 can be used as a measure for dictionary registration.

第３の情報量基準量ＡＩＣ（ＤＭ１２）は、対象ラベルに該当する文書を判定するための辞書登録候補である第１の単語ｗ１と第２の単語ｗ２との組み合わせに関し、対象ラベルと、第１の単語ｗ１および第２の単語ｗ２の組み合わせとの関係の度合いを示す。第３の情報量基準量ＡＩＣ（ＤＭ１２）は、式（７）により算出される。 The third information amount reference amount AIC (DM12) is related to the combination of the first word w1 and the second word w2 which are dictionary registration candidates for determining a document corresponding to the target label, The degree of relationship with the combination of the first word w1 and the second word w2 is shown. The third information amount reference amount AIC (DM12) is calculated by the equation (7).

ＭＬＬ＝Ｎ１１ｌｏｇＮ１１＋Ｎ１２ｌｏｇＮ１２＋Ｎ１３ｌｏｇＮ１３＋Ｎ１４ｌｏｇＮ１４＋Ｎ２１ｌｏｇＮ２１＋Ｎ２２ｌｏｇＮ２２＋Ｎ２３ｌｏｇＮ２３＋Ｎ２４ｌｏｇＮ２４−ＺｌｏｇＺ
ＡＩＣ（ＤＭ１２）＝−２×ＭＬＬ＋２×７
・・・（７） MLL = N11logN11 + N12logN12 + N13logN13 + N14logN14 + N21logN21 + N22logN22 + N23logN23 + N24logN24-ZlogZ
AIC (DM12) = -2 x MLL + 2 x 7
... (7)

この第３の情報量基準量ＡＩＣ（ＤＭ１２）は、対象ラベルに該当する文書を判定するための辞書登録候補である第１の単語ｗ１と第２の単語ｗ２との組み合わせに関し、第１の単語ｗ１と第２の単語ｗ２の組み合わせのみを辞書登録すべきかの尺度として利用することができる。 The third information amount reference amount AIC (DM12) is the first word related to the combination of the first word w1 and the second word w2, which are dictionary registration candidates for determining the document corresponding to the target label. Only a combination of w1 and the second word w2 can be used as a measure of whether or not the dictionary should be registered.

上記式（５），（６），（７）で算出される第１から第３の情報量基準量ＡＩＣ（ＤＭ１），ＡＩＣ（ＤＭ２），ＡＩＣ（ＤＭ１２）は、その値が小さいほど、対象ラベルとの関係の度合いが大きいことを表す。 The first to third information amount reference amounts AIC (DM1), AIC (DM2), and AIC (DM12) calculated by the above formulas (5), (6), and (7) are subject to smaller values. Indicates that the degree of relationship with the label is large.

登録語選択部１７８は、情報量基準量算出部１７７によって算出された第１から第３の情報量基準量ＡＩＣ（ＤＭ１），ＡＩＣ（ＤＭ２），ＡＩＣ（ＤＭ１２）に基づいて、対象ラベルに該当する文書を判定するための辞書登録候補である第１の単語ｗ１と第２の単語ｗ２との組み合わせに関し、第１の単語ｗ１のみを辞書登録すべきか、第２の単語ｗ２のみを辞書登録すべきか、又は、第１の単語ｗ１と第２の単語ｗ２の組み合わせのみを辞書登録すべきか、を判定する。具体的には、第１から第３の情報量基準量ＡＩＣ（ＤＭ１），ＡＩＣ（ＤＭ２），ＡＩＣ（ＤＭ１２）を比較し、この比較の結果、
第１の情報量基準量ＡＩＣ（ＤＭ１）が最小値である場合は第１の単語ｗ１のみを登録語とし、
第２の情報量基準量ＡＩＣ（ＤＭ２）が最小値である場合は第２の単語ｗ２のみを登録語とし、
第３の情報量基準量ＡＩＣ（ＤＭ１２）が最小値である場合は第１の単語ｗ１と第２の単語ｗ２の組み合わせのみを登録語とする。 The registered word selection unit 178 corresponds to the target label based on the first to third information amount reference amounts AIC (DM1), AIC (DM2), and AIC (DM12) calculated by the information amount reference amount calculation unit 177. In relation to the combination of the first word w1 and the second word w2 which are dictionary registration candidates for determining the document to be used, only the first word w1 should be registered in the dictionary, or only the second word w2 should be registered in the dictionary. Or whether only the combination of the first word w1 and the second word w2 should be registered in the dictionary. Specifically, the first to third information amount reference amounts AIC (DM1), AIC (DM2), AIC (DM12) are compared, and as a result of this comparison,
When the first information reference amount AIC (DM1) is the minimum value, only the first word w1 is set as a registered word,
When the second information amount reference amount AIC (DM2) is the minimum value, only the second word w2 is set as a registered word,
When the third information amount reference amount AIC (DM12) is the minimum value, only the combination of the first word w1 and the second word w2 is set as a registered word.

辞書登録部１８０は、対象ラベルに関し、登録語選択部１７８によって選択された登録語を辞書データベース３００へ登録する。これにより、該登録語が該対象ラベルに対応付けて辞書データベース３００に格納される。 The dictionary registration unit 180 registers the registered word selected by the registered word selection unit 178 in the dictionary database 300 with respect to the target label. Thus, the registered word is stored in the dictionary database 300 in association with the target label.

次に、図４を参照して、本実施形態に係る辞書登録装置１００が、ラベル判定結果に基づいて辞書データベース３００に記憶された辞書データを更新する動作例を説明する。
ラベル判定装置２００は、例えば、インターネットを介して取得し記憶した複数のテキストベースのウェブコンテンツを、ラベル判定対象の電子文書として読み出す。そして、ラベル判定装置２００は、辞書データベース３００から読み出した辞書データに含まれるラベルに対応する一定数以上の単語が、ラベル判定対象のウェブコンテンツに含まれるか否かを判定し、そのラベルにそのウェブコンテンツが該当するか否かを判定する。ラベル判定装置２００は、ラベル判定処理を行ったウェブコンテンツと、その判定結果を示す情報とを、電子文書記憶部１１０に記憶させる。 Next, with reference to FIG. 4, the operation example in which the dictionary registration apparatus 100 according to the present embodiment updates the dictionary data stored in the dictionary database 300 based on the label determination result will be described.
The label determination apparatus 200 reads, for example, a plurality of text-based web contents acquired and stored via the Internet as an electronic document to be determined. Then, the label determination apparatus 200 determines whether or not a certain number or more of words corresponding to the label included in the dictionary data read from the dictionary database 300 are included in the label determination target web content. It is determined whether or not the web content is applicable. The label determination apparatus 200 stores the web content subjected to the label determination process and information indicating the determination result in the electronic document storage unit 110.

正規化処理部１２０は、電子文書記憶部１１０に記憶されているラベル判定済みのウェブコンテンツとラベル判定結果とを読み出し（ステップＳ１）、ウェブコンテンツの正規化処理を行う（ステップＳ２）。形態素解析部１３０は、ステップＳ２でウェブコンテンツが正規化された電子文書と、形態素解析用辞書記憶部１４０から読み出す形態素解析用辞書とに基づいて、正規化済電子文書の形態素解析処理を行い、ドキュメントベクトルテーブルを生成する（ステップＳ３）。 The normalization processing unit 120 reads the web content that has been subjected to label determination and the label determination result stored in the electronic document storage unit 110 (step S1), and performs web content normalization processing (step S2). The morpheme analysis unit 130 performs morpheme analysis processing of the normalized electronic document based on the electronic document whose web content is normalized in step S2 and the morpheme analysis dictionary read from the morpheme analysis dictionary storage unit 140. A document vector table is generated (step S3).

単語分布算出部１５０は、ステップＳ３で形態素解析部１３０により生成されたドキュメントベクトルテーブルに基づいて、単語分布表を生成する（ステップＳ４）。ここで、辞書登録装置１００は、電子文書記憶部１１０に単語分布表の更新の対象としていないウェブコンテンツと判定結果とが電子文書記憶部１１０に存在すれば（ステップＳ５：ＹＥＳ）、ステップＳ１からステップＳ４までの処理を繰り返す。 The word distribution calculation unit 150 generates a word distribution table based on the document vector table generated by the morphological analysis unit 130 in step S3 (step S4). Here, the dictionary registration apparatus 100 will start from step S1 if the electronic document memory | storage part 110 exists in the electronic document memory | storage part 110 with the web content and the determination result which are not the update object of a word distribution table in the electronic document memory | storage part 110 (step S5: YES). The process up to step S4 is repeated.

単語分布算出部１５０が、電子文書記憶部１１０に記憶されたウェブコンテンツと判定結果との全てに基づいて、単語分布表の更新を行った場合には（ステップＳ５：ＮＯ）、索引語スコア算出部１７０のＳＳＳ算出部１７１は、上述したＳＳＳ算出処理を行う。ＳＤＳ算出部１７４は、ＳＳＳ算出部１７１が算出したＳＳＳ（Ｗ）に基づいて、ＳＤＳ算出処理を行い、各単語のＳＤＳを求めた単語リストＳＤＳ（Ｗ）を算出する（ステップＳ７）。一方、ＭＳＳ算出部１７２は、ＳＳＳ算出部１７１が算出したＳＳＳ（Ｗ）に基づいて、ＭＳＳ算出処理を行い、各単語のＭＳＳを求めた単語リストＭＳＳ（Ｗ）を算出する（ステップＳ８）。そして、ＭＤＳ算出部１７３は、ＭＳＳ算出部１７２が算出したＭＳＳ（Ｗ）に基づいて、ＭＤＳ算出処理を行い、各単語のＭＤＳを求めた単語リストＭＤＳ（Ｗ）を算出する（ステップＳ９）。 When the word distribution calculation unit 150 updates the word distribution table based on all of the web content stored in the electronic document storage unit 110 and the determination result (step S5: NO), the index word score is calculated. The SSS calculation unit 171 of the unit 170 performs the SSS calculation process described above. The SDS calculation unit 174 performs an SDS calculation process based on the SSS (W) calculated by the SSS calculation unit 171 and calculates a word list SDS (W) for which the SDS of each word is obtained (step S7). On the other hand, the MSS calculation unit 172 performs MSS calculation processing based on the SSS (W) calculated by the SSS calculation unit 171 and calculates a word list MSS (W) for which the MSS of each word is obtained (step S8). Then, the MDS calculation unit 173 performs MDS calculation processing based on the MSS (W) calculated by the MSS calculation unit 172, and calculates the word list MDS (W) for which the MDS of each word is obtained (step S9).

次いで、ステップＳ１０では、まず、文書数算出部１７５が、対象ラベルに該当する文書を判定するための辞書登録候補である第１の単語ｗ１と第２の単語ｗ２との組み合わせ毎に、第１から第８の文書数Ｎ１１，Ｎ１２，Ｎ１３，Ｎ１４，Ｎ２１，Ｎ２２，Ｎ２３，Ｎ２４を算出する。そして、情報量基準量算出部１７７が、文書数算出部１７５によって算出された第１から第８の文書数Ｎ１１，Ｎ１２，Ｎ１３，Ｎ１４，Ｎ２１，Ｎ２２，Ｎ２３，Ｎ２４を用いて、第１から第３の情報量基準量ＡＩＣ（ＤＭ１），ＡＩＣ（ＤＭ２），ＡＩＣ（ＤＭ１２）を算出する。 Next, in step S10, first, the document number calculation unit 175 sets the first word w1 and the second word w2, which are dictionary registration candidates for determining a document corresponding to the target label, for each combination of the first word w1 and the second word w2. To calculate the eighth document number N11, N12, N13, N14, N21, N22, N23, N24. Then, the information amount reference amount calculation unit 177 uses the first to eighth document numbers N11, N12, N13, N14, N21, N22, N23, and N24 calculated by the document number calculation unit 175 from the first to the first. Third information amount reference amounts AIC (DM1), AIC (DM2), and AIC (DM12) are calculated.

次いで、ステップＳ１１では、登録語選択部１７８が、情報量基準量算出部１７７によって算出された第１から第３の情報量基準量ＡＩＣ（ＤＭ１），ＡＩＣ（ＤＭ２），ＡＩＣ（ＤＭ１２）に基づいて、対象ラベルに該当する文書を判定するための辞書登録候補である第１の単語ｗ１と第２の単語ｗ２との組み合わせに関し、第１の単語ｗ１と、第２の単語ｗ２と、第１の単語ｗ１および第２の単語ｗ２の組み合わせとのうち、対象ラベルに関係する度合いが最大であるものを登録語に選択する。 Next, in step S11, the registered word selection unit 178 is based on the first to third information amount reference amounts AIC (DM1), AIC (DM2), and AIC (DM12) calculated by the information amount reference amount calculation unit 177. The first word w1, the second word w2, the first word w1, the second word w2, and the combination of the first word w1 and the second word w2, which are dictionary registration candidates for determining the document corresponding to the target label, Among the combinations of the word w1 and the second word w2, the one having the maximum degree related to the target label is selected as the registered word.

次いで、ステップＳ１２では、辞書登録部１８０が、対象ラベルに関し、登録語選択部１７８によって選択された登録語を辞書データベース３００に辞書データとして記憶させる。 Next, in step S12, the dictionary registration unit 180 causes the dictionary database 300 to store the registered word selected by the registered word selection unit 178 as dictionary data for the target label.

このように、本実施形態によれば、対象ラベルに該当する文書を判定するための辞書登録候補である第１の単語ｗ１と第２の単語ｗ２との組み合わせに関し、第１の単語ｗ１と、第２の単語ｗ２と、第１の単語ｗ１および第２の単語ｗ２の組み合わせとを同じ尺度で評価し、対象ラベルに関係する度合いが最大であるものを辞書データベース３００に登録することができる。これにより、単語および単語の組み合わせを混在させた辞書データベースを生成することができる。この結果、以下に示す効果が得られる。 As described above, according to the present embodiment, the first word w1 with respect to the combination of the first word w1 and the second word w2, which are dictionary registration candidates for determining the document corresponding to the target label, It is possible to evaluate the second word w2 and the combination of the first word w1 and the second word w2 on the same scale, and register the one having the maximum degree related to the target label in the dictionary database 300. Thereby, the dictionary database which mixed the word and the combination of the word can be produced | generated. As a result, the following effects are obtained.

入力文書のトピック判定を行う際に、スコア付き単語（uni-gram）のみによって構成された辞書データベースを使用する場合、特定のトピックに該当する文書の取り逃しは少なくなるが、過剰検出が多くなる。一方、スコア付き単語の組み合わせ（bi-gram）のみによって構成された辞書データベースを使用する場合には、過剰検出は少なくなるが、取り逃しが多くなる。しかし、本実施形態によれば、スコア付き単語およびスコア付き単語の組み合わせを同一の辞書データベース内に混在させて利用することができるので、入力文書のトピック判定を行う際に、過剰検出および取り逃しを共に減らすことができるようになる。 When using a dictionary database composed only of scored words (uni-grams) when determining the topic of an input document, missed documents corresponding to a specific topic are reduced, but overdetection is increased. On the other hand, in the case of using a dictionary database composed only of scored word combinations (bi-grams), overdetection is reduced, but missing is increased. However, according to the present embodiment, a scored word and a combination of scored words can be mixed and used in the same dictionary database. Therefore, when performing topic determination of an input document, excessive detection and missing are not detected. Both can be reduced.

以上、本発明の実施形態について図面を参照して詳述してきたが、具体的な構成はこの実施形態に限られるものではなく、本発明の要旨を逸脱しない範囲の設計変更等も含まれる。
例えば、上述の実施形態では、第１の単語ｗ１と第２の単語ｗ２との組み合わせを辞書登録候補としたが、本発明は、辞書登録候補として３つ以上の単語の組み合わせに対しても同様に適用可能である。 As mentioned above, although embodiment of this invention was explained in full detail with reference to drawings, the specific structure is not restricted to this embodiment, The design change etc. of the range which does not deviate from the summary of this invention are included.
For example, in the above-described embodiment, the combination of the first word w1 and the second word w2 is a dictionary registration candidate, but the present invention is similarly applied to combinations of three or more words as dictionary registration candidates. It is applicable to.

また、図４に示す各ステップを実現するためのプログラムをコンピュータ読み取り可能な記録媒体に記録して、この記録媒体に記録されたプログラムをコンピュータシステムに読み込ませ、実行することにより、辞書登録処理を行ってもよい。なお、ここでいう「コンピュータシステム」とは、ＯＳや周辺機器等のハードウェアを含むものであってもよい。
また、「コンピュータシステム」は、ＷＷＷシステムを利用している場合であれば、ホームページ提供環境（あるいは表示環境）も含むものとする。
また、「コンピュータ読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、フラッシュメモリ等の書き込み可能な不揮発性メモリ、ＤＶＤ（Digital Versatile Disk）等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置のことをいう。 Also, a program for realizing each step shown in FIG. 4 is recorded on a computer-readable recording medium, and the program recorded on the recording medium is read into a computer system and executed, thereby performing dictionary registration processing. You may go. Here, the “computer system” may include an OS and hardware such as peripheral devices.
Further, the “computer system” includes a homepage providing environment (or display environment) if a WWW system is used.
“Computer-readable recording medium” refers to a flexible disk, a magneto-optical disk, a ROM, a writable nonvolatile memory such as a flash memory, a portable medium such as a DVD (Digital Versatile Disk), and a built-in computer system. A storage device such as a hard disk.

さらに「コンピュータ読み取り可能な記録媒体」とは、インターネット等のネットワークや電話回線等の通信回線を介してプログラムが送信された場合のサーバやクライアントとなるコンピュータシステム内部の揮発性メモリ（例えばＤＲＡＭ（Dynamic Random Access Memory））のように、一定時間プログラムを保持しているものも含むものとする。
また、上記プログラムは、このプログラムを記憶装置等に格納したコンピュータシステムから、伝送媒体を介して、あるいは、伝送媒体中の伝送波により他のコンピュータシステムに伝送されてもよい。ここで、プログラムを伝送する「伝送媒体」は、インターネット等のネットワーク（通信網）や電話回線等の通信回線（通信線）のように情報を伝送する機能を有する媒体のことをいう。
また、上記プログラムは、前述した機能の一部を実現するためのものであっても良い。さらに、前述した機能をコンピュータシステムにすでに記録されているプログラムとの組み合わせで実現できるもの、いわゆる差分ファイル（差分プログラム）であっても良い。 Further, the “computer-readable recording medium” means a volatile memory (for example, DRAM (Dynamic DRAM) in a computer system that becomes a server or a client when a program is transmitted through a network such as the Internet or a communication line such as a telephone line. Random Access Memory)), etc., which hold programs for a certain period of time.
The program may be transmitted from a computer system storing the program in a storage device or the like to another computer system via a transmission medium or by a transmission wave in the transmission medium. Here, the “transmission medium” for transmitting the program refers to a medium having a function of transmitting information, such as a network (communication network) such as the Internet or a communication line (communication line) such as a telephone line.
The program may be for realizing a part of the functions described above. Furthermore, what can implement | achieve the function mentioned above in combination with the program already recorded on the computer system, and what is called a difference file (difference program) may be sufficient.

１００…辞書登録装置、２００…ラベル判定装置、３００…辞書データベース、１１０…電子文書記憶部、１２０…正規化処理部、１３０…形態素解析部、１４０…形態素解析用辞書記憶部、１５０…単語分布算出部、１６０…単語分布表記憶部、１７０…索引語スコア算出部、１７１…ＳＳＳ算出部、１７２…ＭＳＳ算出部、１７３…ＭＤＳ算出部、１７４…ＳＤＳ算出部、１７５…文書数算出部、１７６…入力部、１７７…情報量基準量算出部、１７８…登録語選択部、１８０…辞書登録部 DESCRIPTION OF SYMBOLS 100 ... Dictionary registration apparatus, 200 ... Label determination apparatus, 300 ... Dictionary database, 110 ... Electronic document memory | storage part, 120 ... Normalization process part, 130 ... Morphological analysis part, 140 ... Dictionary storage part for morphological analysis, 150 ... Word distribution Calculation unit, 160 ... Word distribution table storage unit, 170 ... Index word score calculation unit, 171 ... SSS calculation unit, 172 ... MSS calculation unit, 173 ... MDS calculation unit, 174 ... SDS calculation unit, 175 ... Document number calculation unit, 176 ... Input unit, 177 ... Information amount reference amount calculation unit, 178 ... Registered word selection unit, 180 ... Dictionary registration unit

Claims

An electronic document storage unit for storing a set of correct documents related to a specific property and a set of incorrect documents not related to the property;
An input unit for inputting a first word and a second word, which are dictionary registration candidates for determining a document related to the property;
For each combination of the first word and the second word, the first document number that is the number of the correct documents including both the first word and the second word; A second document number that is the number of correct documents that include one word and does not include the second word, and a number of correct documents that include the second word but not the first word. A third document number, a fourth document number that is the number of the correct documents that do not include any of the first word and the second word, the first word, and the first word A fifth document number that is the number of incorrect documents that include both of the two words, and a fifth document number that is the number of incorrect documents that include the first word and not include the second word. A document number of 6, a seventh document number that is the number of incorrect documents in which the first word is not included and the second word is included, and the first word and the second word Izu A document number calculating unit for calculating an eighth number of documents of a number of not included said incorrect document,
A first information amount reference amount indicating a degree of a relationship between the first word and the property based on the first to eighth document numbers calculated by the document number calculating unit; A second information amount reference amount indicating the degree of the relationship between the word and the property, and a third information amount reference indicating the degree of the relationship between the combination of the first word and the second word and the property An information amount reference amount calculation unit for calculating the amount,
Comparing the information amount reference amount calculated by the information amount reference amount calculation unit, the first word, the second word, and the combination of the first word and the second word Among them, a registered word selection unit that selects a registered word that has the maximum degree related to the property;
A dictionary registration device comprising:

The first information criterion “AIC (DM1)” indicating the degree of the relationship between the first word and the property is:
MLL = (N11 + N12) log (N11 + N12) + (N13 + N14) log (N13 + N14) + (N21 + N22) log (N21 + N22) + (N23 + N24) log (N23 + N24) + (N11 + N13 + N21 + N24) 2 × ZlogZ,
AIC (DM1) = − 2 × MLL + 2 × 4,
Calculated by the following formula:
However, the base 10 of the log is omitted, and Z = N11 + N12 + N13 + N14 + N21 + N22 + N23 + N24, where N11 is the first document number, N12 is the second document number, N13 is the third document number, and N14 is The fourth document number, N21 is the fifth document number, N22 is the sixth document number, N23 is the seventh document number, and N24 is the eighth document number.
The dictionary registration device according to claim 1.

A second information amount reference amount “AIC (DM2)” indicating the degree of the relationship between the second word and the property is:
MLL = (N11 + N13) log (N11 + N13) + (N12 + N14) log (N12 + N14) + (N21 + N23) log (N21 + N23) + (N22 + N24) log (N22 + N24) + (N11 + N12 + N21 + N21 + N14 + N21 + N21 + N14 + N14 + N14 + N24 + N24 + N24 2 × ZlogZ,
AIC (DM2) = − 2 × MLL + 2 × 4,
Calculated by the following formula:
However, the base 10 of the log is omitted, and Z = N11 + N12 + N13 + N14 + N21 + N22 + N23 + N24, where N11 is the first document number, N12 is the second document number, N13 is the third document number, and N14 is The fourth document number, N21 is the fifth document number, N22 is the sixth document number, N23 is the seventh document number, and N24 is the eighth document number.
The dictionary registration apparatus according to claim 1, wherein the dictionary registration apparatus is a dictionary registration apparatus.

The third information amount reference amount “AIC (DM12)” indicating the degree of the relationship between the combination of the first word and the second word and the property is:
MLL = N11logN11 + N12logN12 + N13logN13 + N14logN14 + N21logN21 + N22logN22 + N23logN23 + N24logN24-ZlogZ,
AIC (DM12) = − 2 × MLL + 2 × 7,
Calculated by the following formula:
However, the base 10 of the log is omitted, and Z = N11 + N12 + N13 + N14 + N21 + N22 + N23 + N24, where N11 is the first document number, N12 is the second document number, N13 is the third document number, and N14 is The fourth document number, N21 is the fifth document number, N22 is the sixth document number, N23 is the seventh document number, and N24 is the eighth document number.
The dictionary registration device according to claim 1, wherein the dictionary registration device is a dictionary registration device.

The dictionary registration apparatus according to claim 1, further comprising a normalization processing unit that performs normalization processing of words.

The dictionary registration apparatus according to claim 1, further comprising a morphological analysis unit that extracts a word from a document.

The dictionary registration device according to any one of claims 1 to 6,
A dictionary database in which a registered word selected by the dictionary registration device is stored in association with a label representing a specific property;
A label determination device for determining a label corresponding to an input document using the dictionary database;
A document label determination system comprising:

An electronic document storage unit storing a set of correct documents related to a specific property and a set of incorrect documents not related to the property, and a dictionary registration candidate for determining a document related to the property A computer having an input unit for inputting a first word and a second word;
For each combination of the first word and the second word, the first document number that is the number of the correct documents including both the first word and the second word; A second document number that is the number of correct documents that include one word and does not include the second word, and a number of correct documents that include the second word but not the first word. A third document number, a fourth document number that is the number of the correct documents that do not include any of the first word and the second word, the first word, and the first word A fifth document number that is the number of incorrect documents that include both of the two words, and a fifth document number that is the number of incorrect documents that include the first word and not include the second word. A document number of 6, a seventh document number that is the number of incorrect documents in which the first word is not included and the second word is included, and the first word and the second word Izu Calculating a number of documents eighth the number of not included said incorrect document,
Based on the calculated first to eighth document numbers, a first information amount reference amount indicating a degree of relationship between the first word and the property, the second word, and the property And a second information amount reference amount indicating the degree of the relationship between the first word and the combination of the second word and the property. Steps,
Comparing the calculated information amount reference amount, the first word, the second word, and the combination of the first word and the second word are related to the property. Selecting the registered word with the maximum degree,
Dictionary registration program for running