JP2013145461A

JP2013145461A - Dictionary generating device, document label determination system, and computer program

Info

Publication number: JP2013145461A
Application number: JP2012005454A
Authority: JP
Inventors: Hajime Hattori; 元服部; Tadashi Yanagihara; 正柳原; Toshihiro Ono; 智弘小野
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2012-01-13
Filing date: 2012-01-13
Publication date: 2013-07-25
Anticipated expiration: 2032-01-13
Also published as: JP5739352B2

Abstract

【課題】スコア付き単語のみによる辞書とスコア付き単語の組み合わせが混在した辞書において、辞書内における重複した項目を取り除くこと。
【解決手段】連続する単語の情報量基準量を算出するスコア計算部１２と、情報量基準量に基づき、フィルタリング辞書に登録する単語または連続する単語の組み合わせを決定する辞書登録候補選択部１６と、選択された辞書登録候補を該当する情報量基準量に基づいたスコアとともに辞書に登録する辞書登録部１８と、辞書登録候補選択部１６が選択した辞書登録候補を含む正解文書及び不正解文書を、正解文書の集合及び不正解文書の集合から削除し、新たな正解文書の集合及び不正解文書の集合を構築する入力文書フィルタ部１７と、を備え、スコア計算部１２は、入力文書フィルタ部１７が構築した新たな正解文書の集合及び不正解文書の集合に基づいて連続する単語の情報量基準量を算出する。
【選択図】図１In a dictionary in which a combination of a scored word only dictionary and a scored word is mixed, duplicate items in the dictionary are removed.
A score calculation unit that calculates an information amount reference amount of continuous words, a dictionary registration candidate selection unit that determines a word to be registered in a filtering dictionary or a combination of consecutive words based on the information amount reference amount, and A dictionary registration unit 18 for registering the selected dictionary registration candidate in the dictionary together with a score based on the corresponding information amount reference amount; and correct and incorrect documents including the dictionary registration candidate selected by the dictionary registration candidate selection unit 16 An input document filter unit 17 that deletes the set of correct documents and the set of incorrect documents and constructs a new set of correct documents and a set of incorrect documents, and the score calculation unit 12 includes an input document filter unit Based on the new set of correct documents and the set of incorrect documents constructed by 17, an information amount reference amount of consecutive words is calculated.
[Selection] Figure 1

Description

本発明は、辞書生成装置、文書ラベル判定システム及びコンピュータプログラムに関する。 The present invention relates to a dictionary generation device, a document label determination system, and a computer program.

従来、ブログ等のテキストベースのウェブコンテンツや、ワープロソフトなどによって生成される文書ファイルなどの電子文書に対して、その電子文書に含まれるテキスト情報の内容がどのような性質を持つものであるかを判定し、その内容に応じたラベルを付与して電子文書を分類する文書ラベル判定システムが利用されている。ラベルには、例えば、スポーツ、経済などの電子文書のトピックを示すラベルがある。このようなラベルのうち、特定のラベルにラベル判定対象の電子文書が該当するか否かを判定する際には、その特定のラベルに関連性の高い複数の索引語が対応付けられた辞書データが用いられる。例えば、ラベルが「経済」である場合には、索引語として「財務省」、「為替」などの単語が対応付けられた辞書データが予め準備される。文書ラベル判定システムは、辞書データに含まれる索引語に一致する単語をラベル判定対象の電子文書から検出し、その一致の度合いに応じて、その電子文書が特定のラベルに該当するか否かを判定する。 Conventionally, for text-based web content such as blogs and electronic documents such as document files generated by word processing software, what kind of property the text information contained in the electronic document has? A document label determination system that classifies electronic documents by assigning a label according to the content of the document and using the label is used. Examples of the label include a label indicating a topic of an electronic document such as sports or economy. Among such labels, when determining whether or not an electronic document that is a label determination target corresponds to a specific label, dictionary data in which a plurality of highly relevant index terms are associated with the specific label Is used. For example, when the label is “economic”, dictionary data in which words such as “Ministry of Finance” and “Exchange” are associated as index words is prepared in advance. The document label determination system detects a word that matches the index word included in the dictionary data from the electronic document to be determined, and determines whether the electronic document corresponds to a specific label according to the degree of the match. judge.

特許文献１に記載の従来の辞書生成技術では、スコア付き単語に係るモデル検定を行ってスコア付き単語のみによる辞書を作成すると共に、スコア付き単語の組み合わせに係るモデル検定を行ってスコア付き単語の組み合わせのみによる辞書を作成している。又、非特許文献１には、情報量基準に基づくモデル検定を行い、トピックに該当するかを判定する上で重要な単語のみを選出する技術が提案されている。 In the conventional dictionary generation technique described in Patent Document 1, a model test related to a scored word is performed to create a dictionary based only on the scored word, and a model test related to a combination of scored words is performed to determine the scored word. Create a dictionary with only combinations. Further, Non-Patent Document 1 proposes a technique for selecting only words that are important for determining whether a topic falls under a model test based on an information criterion.

特開２０１０−０１５３９５号公報JP 2010-015395 A

Kazunori Matsumoto, Kazuo Hashimoto, "Schema Design for Causal Law Mining from Incomplete Database", Discovery Science, Second International Conference, DS '99, Tokyo, Japan, December, 1999,Proceedings. Lecture Notes in Computer Science 1721 Springer, pp.92-102, 1999.Kazunori Matsumoto, Kazuo Hashimoto, "Schema Design for Causal Law Mining from Incomplete Database", Discovery Science, Second International Conference, DS '99, Tokyo, Japan, December, 1999, Proceedings. Lecture Notes in Computer Science 1721 Springer, pp.92 -102, 1999.

しかし、上述した従来の辞書生成技術では、スコア付き単語に係るモデル検定とスコア付き単語の組み合わせに係るモデル検定とを独立に行っているために、両者のモデル検定結果を同等に扱うことができず、従ってスコア付き単語のみによる辞書とスコア付き単語の組み合わせのみによる辞書とを混在させることができない。 However, in the conventional dictionary generation technique described above, the model test related to the scored word and the model test related to the combination of the scored word are performed independently, so that both model test results can be handled equally. Therefore, it is not possible to mix a dictionary based only on scored words and a dictionary based only on combinations of scored words.

さらに、あるトピックｘに該当する文書集合において、特定の内容の文書が偏って多く存在する場合に、フィルタリング用辞書に登録する単語として、その多く存在する文書に出現する単語が多く選択されてしまう。これによって、本来特定の文書を判定する上で不要となるはずの単語または単語の組み合わせが、上記辞書に登録される場合がある。その場合、トピックｘに該当するかどうかを判定するのではなく特定の内容の文書を判定することとなり、結果として文書ラベル判定システムが本来特定のトピックに該当しない文書を特定のトピックに該当すると誤って判定してしまう。
特許文献1では、単語の組み合わせを形成する際にこの問題を解決しているが、辞書に単語または単語の組み合わせが登録される場合には、特許文献１の手法を適用することができなかった。 Furthermore, in a document set corresponding to a certain topic x, when there are a large number of documents with specific contents, many words appearing in the existing documents are selected as words to be registered in the filtering dictionary. . As a result, a word or a combination of words that should be unnecessary in determining a specific document may be registered in the dictionary. In that case, instead of determining whether or not the topic x falls under the category x, it is determined that the document has a specific content. As a result, the document label determination system erroneously identifies a document that does not originally correspond to a specific topic as a specific topic. Will be judged.
Patent Document 1 solves this problem when forming word combinations. However, when a word or a combination of words is registered in a dictionary, the method of Patent Document 1 cannot be applied. .

本発明は、このような事情を考慮してなされたもので、スコア付き単語のみによる辞書とスコア付き単語の組み合わせのみによる辞書とを混在させることができ、かつ同じ単語または同じ単語の組み合わせが辞書に重複して登録されることを防ぐ辞書生成装置、文書ラベル判定システム及びコンピュータプログラムを提供することを課題とする。 The present invention has been made in consideration of such circumstances, and a dictionary based only on a scored word and a dictionary based only on a combination of scored words can be mixed, and the same word or a combination of words can be a dictionary. It is an object of the present invention to provide a dictionary generation device, a document label determination system, and a computer program that prevent registration of information in duplicate.

上記の課題を解決するために、本発明に係る辞書生成装置は、特定の性質に関係する正解文書の集合と前記性質に関係しない不正解文書の集合とを用いて、前記性質に関係する文書であるか否かを判定するための辞書を生成する辞書生成装置において、前記正解文書又は前記不正解文書に含まれる一つ以上の単語をそれぞれ辞書登録候補とし、連続する前記単語の情報量基準量を算出するスコア計算部と、前記情報量基準量に基づき、フィルタリング辞書に登録する単語または連続する単語の組み合わせを決定する辞書登録候補選択部と、前記選択された辞書登録候補を該当する前記情報量基準量に基づいたスコアとともに前記辞書に登録する辞書登録部と、前記辞書登録候補選択部が選択した辞書登録候補を含む前記正解文書及び前記不正解文書を、前記正解文書の集合及び前記不正解文書の集合から削除し、新たな正解文書の集合及び不正解文書の集合を構築する入力文書フィルタ部と、を備え、前記スコア計算部は、前記入力文書フィルタ部が構築した新たな正解文書の集合及び不正解文書の集合に含まれる一つ以上の単語をそれぞれ辞書登録候補とし、連続する前記単語の情報量基準量を算出することを特徴とする。 In order to solve the above-described problem, the dictionary generation device according to the present invention uses a set of correct documents related to a specific property and a set of incorrect documents not related to the property to generate a document related to the property. In the dictionary generating device for generating a dictionary for determining whether or not the word is one or more words included in the correct answer document or the incorrect answer document as dictionary registration candidates, and an information amount criterion for consecutive words A score calculation unit that calculates an amount, a dictionary registration candidate selection unit that determines a word to be registered in a filtering dictionary or a combination of consecutive words based on the information amount reference amount, and the selected dictionary registration candidate A dictionary registration unit that registers in the dictionary together with a score based on a reference amount of information; the correct answer document that includes the dictionary registration candidate selected by the dictionary registration candidate selection unit; and the fraud An input document filter unit that deletes a document from the set of correct documents and the set of incorrect documents and constructs a new set of correct documents and a set of incorrect documents, and the score calculation unit includes: One or more words included in a set of new correct documents and a set of incorrect documents constructed by the input document filter unit are respectively dictionary registration candidates, and an information amount reference amount of the consecutive words is calculated. To do.

上記に記載の辞書生成装置において、本発明の一態様は、前記入力文書フィルタ部は、前記スコア計算部および辞書登録候補選択部が特定の文書集合に偏った単語選択が行われることを避けることを目的とした、前記正解文書および前記不正解文書の集合の中から、前記フィルタリング辞書に既に登録されている単語で、トピック判定装置が正解文書として判定可能な文書を取り除くことを特徴とする。 In the dictionary generation device described above, according to one aspect of the present invention, the input document filter unit prevents the score calculation unit and the dictionary registration candidate selection unit from performing word selection biased to a specific document set. From the set of the correct document and the incorrect document for the purpose, a word that can be determined as a correct document by the topic determination device is removed from words already registered in the filtering dictionary.

上記に記載の辞書生成装置において、本発明の一態様は、前記辞書登録候補選択部は、連続する単語の組み合わせが前記不正解文書中よりも前記正解文書中により多く含まれる場合にのみ連続する単語の組み合わせをフィルタリング辞書に登録する単語として選択することを特徴とする。 In the dictionary generation device described above, one aspect of the present invention is that the dictionary registration candidate selection unit is continuous only when a combination of consecutive words is included more in the correct answer document than in the incorrect answer document. A combination of words is selected as a word to be registered in the filtering dictionary.

上記に記載の辞書生成装置において、本発明の一態様は、前記辞書登録部は、同じ辞書登録候補が複数選択されている場合に、該複数の辞書登録候補に係るスコアのうち最小のスコアを当該辞書登録候補のスコアとすることを特徴とする。 In the dictionary generation device described above, according to one aspect of the present invention, when the same dictionary registration candidate is selected, the dictionary registration unit calculates a minimum score among the scores related to the plurality of dictionary registration candidates. A score of the dictionary registration candidate is used.

上記に記載の辞書生成装置において、本発明の一態様は、前記スコア計算部は、入力文書フィルタ部から、新たな正解文書の集合及び不正解文書の集合を入力し、前記選択された辞書登録候補を該当する前記情報量基準量に基づいたスコアとともに前記辞書に登録することを特徴とする。 In the dictionary generation device described above, according to one aspect of the present invention, the score calculation unit inputs a new set of correct documents and a set of incorrect documents from the input document filter unit, and the selected dictionary registration The candidate is registered in the dictionary together with a score based on the corresponding information amount reference amount.

上記に記載の辞書生成装置において、本発明の一態様は、前記スコア計算部は、前記正解文書および不正解文書に含まれる連続する二つ以上の単語を抽出し、それらの単語が単独で出現する場合、および、二つ以上連続して出現する場合それぞれについて、前記正解文書および不正解文書内に含まれる回数を算出し、算出した前記回数に基づき前記連続する二つ以上の単語の組み合わせに対する情報量基準量を算出し、前記情報量基準量に基づき辞書登録候補を選択することを特徴とする。 In the dictionary generation device described above, according to one aspect of the present invention, the score calculation unit extracts two or more consecutive words included in the correct answer document and the incorrect answer document, and these words appear alone. And the number of times included in the correct answer document and the incorrect answer document for each of two or more consecutive occurrences, and the combination of two or more consecutive words based on the calculated number of times. An information amount reference amount is calculated, and dictionary registration candidates are selected based on the information amount reference amount.

本発明の一態様は、上記に記載の辞書生成装置と、入力文書に対してテキストデータ以外のデータの削除を行う文書正規化部を備えたことを特徴とする多様な入力文書に対応可能な辞書生成装置である。 One aspect of the present invention is compatible with a variety of input documents including the dictionary generation device described above and a document normalization unit that deletes data other than text data from the input document. It is a dictionary generation device.

本発明の一態様は、上記に記載の辞書生成装置と、前記辞書生成装置によって生成された、特定の性質を表すラベルに対応付けてスコア付き単語及びスコア付き単語の組み合わせを格納するフィルタリング辞書と、前記フィルタリング辞書を用いて入力文書に対応するラベルを判定するトピック判定装置と、を備えたことを特徴とする文書ラベル判定システムである。 One aspect of the present invention is the dictionary generation device described above, a filtering dictionary that stores a scored word and a combination of scored words in association with a label representing a specific property generated by the dictionary generation device, A document label determination system comprising: a topic determination device that determines a label corresponding to an input document using the filtering dictionary.

本発明の一態様は、特定の性質に関係する正解文書の集合と前記性質に関係しない不正解文書の集合とを用いて、前記性質に関係する文書であるか否かを判定するための辞書を生成する処理を行うためのコンピュータプログラムであって、前記正解文書又は前記不正解文書に含まれる一つ以上の単語をそれぞれ辞書登録候補とし、連続する前記単語の情報量基準量を算出するステップと、前記情報量基準量に基づき、フィルタリング辞書に登録する単語または連続する単語の組み合わせを決定するステップと、前記選択された辞書登録候補を該当する前記情報量基準量に基づいたスコアとともに前記辞書に登録するステップと、前記辞書登録候補選択部が選択した辞書登録候補を含む前記正解文書及び前記不正解文書を、前記正解文書の集合及び前記不正解文書の集合から削除し、新たな正解文書の集合及び不正解文書の集合を構築するステップと、前記構築した新たな正解文書の集合及び不正解文書の集合に含まれる一つ以上の単語をそれぞれ辞書登録候補とし、連続する前記単語の情報量基準量を算出するステップと、をコンピュータに実行させるためのコンピュータプログラムである。 One aspect of the present invention is a dictionary for determining whether a document is related to the property by using a set of correct documents related to a specific property and a set of incorrect documents not related to the property. And calculating a reference amount of information for successive words using one or more words included in the correct document or the incorrect document as dictionary registration candidates, respectively. A step of determining a word to be registered in a filtering dictionary or a combination of consecutive words based on the information amount reference amount; and the dictionary together with the score based on the corresponding information amount reference amount for the selected dictionary registration candidate Registering the correct answer document and the incorrect answer document including the dictionary registration candidate selected by the dictionary registration candidate selecting unit into the set of correct answer documents. Deleting from the set of incorrect documents and constructing a new set of correct documents and a set of incorrect documents, and one or more included in the set of new correct documents and the set of incorrect documents Is a computer program for causing a computer to execute the step of calculating a reference amount of information for successive words using the words as dictionary registration candidates.

本発明の一態様は、特定の性質に関係する正解文書の集合と前記性質に関係しない不正解文書の集合とを用いて、前記性質に関係する文書であるか否かを判定するための辞書を生成する処理を行うためのコンピュータプログラムであって、前記正解文書および不正解文書に含まれる連続する二つ以上の単語を抽出し、それらの単語が単独で出現する場合、および、二つ以上連続して出現する場合それぞれについて、前記正解文書および不正解文書内に含まれる回数を算出し、算出した前記回数に基づき前記連続する二つ以上の単語の組み合わせに対する情報量基準量を算出し、前記情報量基準量に基づき辞書登録候補を選択することをコンピュータに実行させるためのコンピュータプログラムである。
これにより、上述の辞書生成装置がコンピュータを利用して実現できるようになる。 One aspect of the present invention is a dictionary for determining whether a document is related to the property by using a set of correct documents related to a specific property and a set of incorrect documents not related to the property. A computer program for performing processing to extract two or more consecutive words included in the correct answer document and the incorrect answer document, and when those words appear alone, and two or more For each case where it appears continuously, calculate the number of times included in the correct answer document and the incorrect answer document, calculate an information amount reference amount for the combination of two or more consecutive words based on the calculated number of times, A computer program for causing a computer to select a dictionary registration candidate based on the information amount reference amount.
Thereby, the above-described dictionary generation device can be realized using a computer.

本発明によれば、トピックｘに該当する文書集合において、特定の内容の文書が偏って多く存在する場合に、フィルタリング用辞書に登録する単語として、その多く存在する文書に出現する単語が多く選択されてしまうことを防止することができるという効果が得られる。あるいは、比較的少ない種類の文書からも単語を選択することができるという効果が得られる。 According to the present invention, in the document set corresponding to the topic x, when there are a large number of documents with specific contents, many words appearing in the existing documents are selected as words to be registered in the filtering dictionary. The effect that it can prevent being carried out is acquired. Alternatively, it is possible to select a word from a relatively small number of types of documents.

本発明の一実施形態に係る文書ラベル判定システムの構成を示すブロック図である。It is a block diagram which shows the structure of the document label determination system which concerns on one Embodiment of this invention. 同実施形態に係る辞書生成処理の流れを示すフローチャートである。It is a flowchart which shows the flow of the dictionary production | generation process which concerns on the same embodiment. ２×２分割表の構成例である。It is a structural example of a 2 × 2 contingency table. ２×４分割表の構成例である。It is a structural example of a 2x4 contingency table. 本発明の一実施形態に係る辞書登録候補選択処理のプログラムの例である。It is an example of the program of the dictionary registration candidate selection process which concerns on one Embodiment of this invention.

以下、図面を参照し、本発明の実施形態について説明する。
図１は、本発明の一実施形態に係る文書ラベル判定システムの構成を示すブロック図である。図１において、トピック判定装置３２は、フィルタリング用辞書３０を用いて、データ（テキストデータ）１００から成る入力文書に対応するラベルを判定する。ラベルは、トピックなど、文書の性質を示す。本実施形態では、ラベルは、文書のトピックを示すものとして定義されているとする。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.
FIG. 1 is a block diagram showing a configuration of a document label determination system according to an embodiment of the present invention. In FIG. 1, the topic determination device 32 uses a filtering dictionary 30 to determine a label corresponding to an input document composed of data (text data) 100. The label indicates the nature of the document, such as a topic. In the present embodiment, it is assumed that the label is defined as indicating the topic of the document.

トピック判定装置３２は、ラベル判定結果として、特定のトピックｘに該当する文章（正解文書）１１０と、特定のトピックｘに該当しない文章（不正解文書）１２０とを出力する。この出力データ（特定のトピックｘに関係する正解文書の集合、及び特定のトピックｘに関係しない不正解文書の集合）は、辞書生成装置２に入力される。 The topic determination device 32 outputs a sentence (correct document) 110 corresponding to a specific topic x and a sentence (incorrect document) 120 not corresponding to the specific topic x as a label determination result. This output data (a set of correct documents related to a specific topic x and a set of incorrect documents not related to a specific topic x) is input to the dictionary generation device 2.

なお、辞書生成装置２に入力するデータ（特定のトピックｘに関係する正解文書の集合、及び特定のトピックｘに関係しない不正解文書の集合）は、トピック判定装置３２以外のトピック判定手段（例えば人間）が生成してもよい。 Note that data (a set of correct documents related to a specific topic x and a set of incorrect documents not related to a specific topic x) input to the dictionary generation device 2 is a topic determination unit other than the topic determination device 32 (for example, (Human) may generate.

辞書生成装置２は、特定のトピックｘに関係する正解文書の集合と特定のトピックｘに関係しない不正解文書の集合とを用いて、特定のトピックｘに関係する文書であるか否かを判定するためのフィルタリング用辞書３０を生成する。フィルタリング用辞書３０は、特定のトピックｘを表すラベルに対応付けてスコア付き単語及びスコア付き単語の組み合わせを格納する辞書データベースである。 The dictionary generation device 2 determines whether or not the document is related to the specific topic x by using a set of correct documents related to the specific topic x and a set of incorrect documents not related to the specific topic x. A filtering dictionary 30 is generated for this purpose. The filtering dictionary 30 is a dictionary database that stores scored words and combinations of scored words in association with labels representing a specific topic x.

辞書生成装置２は、文書正規化部４と形態素解析部６とスコア計算部１２と辞書登録候補選択部１６と入力文章フィルタ部１７と、辞書登録部１８とを有する。 The dictionary generation device 2 includes a document normalization unit 4, a morpheme analysis unit 6, a score calculation unit 12, a dictionary registration candidate selection unit 16, an input sentence filter unit 17, and a dictionary registration unit 18.

以下、図２を参照して、図１に示す辞書生成装置２の動作を説明する。図２は、本実施形態に係る辞書生成処理の流れを示すフローチャートである。 Hereinafter, the operation of the dictionary generation device 2 shown in FIG. 1 will be described with reference to FIG. FIG. 2 is a flowchart showing the flow of dictionary generation processing according to the present embodiment.

［ステップＳ１：文書の正規化作業］
文書正規化部４は、正解文書及び不正解文書に対して文書の正規化を行う。文書の正規化では、所定の規則に従って、表記の揺れを統一したり又はタグを除去したりする。 [Step S1: Document normalization]
The document normalization unit 4 normalizes documents for correct and incorrect documents. In document normalization, the fluctuation of the notation is unified or the tag is removed according to a predetermined rule.

ここで、文書の正規化作業の具体例を挙げる。
入力文書は、ブログ記事（本文（テキストデータ、絵文字を含む、ＨＴＭＬ（HyperText Markup Language）タグを含む）、画像は無し）と、ブログコメント（本文（テキストデータ、ＨＴＭＬタグを含む）、画像は無し）であるとする。文書正規化部４は、所定の正規化規則に従って入力文書の本文を正規化理し、正規化処理後の文書を出力する。以下に正規化規則の例を示す。
（正規化規則の例）
・ハイフン「‐」、マイナス記号「−」及び長音記号「ー」を所定の記号（例えば「−」）に統一する。
・半角文字を全角文字に変換する。
・タブ文字を空白に置き換える。
・絵文字を特定の文字記号（０ｘＡ２Ａ２）に置き換える。
・ＨＴＭＬタグを削除する。
・日本語の小文字を大文字に変換する。例えば「ィ」を「イ」に変換する。但し、不図示の形態素解析用辞書において、小文字有りの状態で登録されている場合には変換しないで小文字のまま残す。 Here, a specific example of document normalization work is given.
Input documents are blog articles (text (text data, including pictographs, HTML (including HyperText Markup Language) tags), no images), blog comments (text (including text data, HTML tags), no images) ). The document normalization unit 4 normalizes the text of the input document according to a predetermined normalization rule, and outputs the document after normalization processing. The following are examples of normalization rules.
(Example of normalization rules)
Unify hyphen "-", minus sign "-" and long sound sign "-" into predetermined symbols (for example, "-").
・ Convert half-width characters to full-width characters.
-Replace tab characters with spaces.
Replace pictographs with specific character symbols (0xA2A2).
-Delete HTML tags.
・ Convert Japanese lowercase letters to uppercase. For example, “i” is converted to “i”. However, in a morphological analysis dictionary (not shown), if it is registered with a lowercase letter, it is left as it is without conversion.

［ステップＳ２：文書の形態素処理］
形態素解析部６は、文書正規化部４から出力された正規化後の正解文書及び不正解文書に対して文書の形態素処理を行う。文書の形態素処理では、不図示の形態素解析用辞書を用いて、文章を単語単位に分割し、各単語に品詞を割り当てる。そして、所定の品詞が割り当てられた単語を抽出する。 [Step S2: Morphological Processing of Document]
The morpheme analysis unit 6 performs morpheme processing on the normalized correct document and the incorrect document output from the document normalization unit 4. In document morpheme processing, a morphological analysis dictionary (not shown) is used to divide a sentence into words and assign parts of speech to each word. Then, a word assigned with a predetermined part of speech is extracted.

ここで、文書の形態素処理の具体例を挙げる。
入力文書は、正規化後のブログ記事（本文（正規化済みのテキストデータ））と、正規化後のブログコメント（本文（正規化済みのテキストデータ））であるとする。形態素解析部６は、不図示の形態素解析用辞書を用いて、正規化後のブログ記事及びブログコメントに対し、文章を単語単位に分割して各単語に品詞を割り当てる。次いで、形態素解析部６は、所定の品詞（例えば、名詞）が割り当てられた単語を抽出する。次いで、形態素解析部６は、抽出した単語に対して、英単語の正規化（例えば、小文字を大文字に変換する)を行ったり、カタカナの単語の正規化（例えば、「コンピューター」を「コンピュータ」に変換する）を行ったりする。形態素解析部６は、正規化後の単語を頻度計算対象単語表に格納する。但し、同じ単語が頻度計算対象単語表に重複して格納されないようにする。 Here, a specific example of document morpheme processing will be given.
It is assumed that the input document is a blog article after normalization (text (normalized text data)) and a blog comment after normalization (text (normalized text data)). The morpheme analysis unit 6 uses a morpheme analysis dictionary (not shown) to divide a sentence into words and assign a part of speech to each word for the normalized blog article and blog comment. Next, the morpheme analyzer 6 extracts a word to which a predetermined part of speech (for example, a noun) is assigned. Next, the morphological analysis unit 6 normalizes English words (for example, converts lowercase letters to uppercase letters) for the extracted words, or normalizes katakana words (for example, “computer” is changed to “computer”). Or convert to). The morphological analysis unit 6 stores the normalized word in the frequency calculation target word table. However, the same word is not stored repeatedly in the frequency calculation target word table.

［ステップＳ３：ＳＳＳ（シングルスタティックスコア）、ＭＳＳ（マルチスタティックスコア）の計算］
スコア計算部１２は、形態素解析部６から出力された頻度計算対象単語表に格納される各単語を対象にして、ＳＳＳに関する図３に示される２×２分割表を作成する。図３において、単語ｗに関するａ、ｂ、ｃ、ｄは以下の値である。
ａ：正解文書の集合ＤＯＣ_Ｍのうち、単語ｗを含んでいる文書の数
ｂ：正解文書の集合ＤＯＣ_Ｍのうち、単語ｗを含んでいない文書の数
ｃ：不正解文書の集合ＤＯＣ_Ｎのうち、単語ｗを含んでいる文書の数
ｄ：不正解文書の集合ＤＯＣ_Ｎのうち、単語ｗを含んでいない文書の数 [Step S3: Calculation of SSS (Single Static Score), MSS (Multi Static Score)]
The score calculation unit 12 creates a 2 × 2 partition table shown in FIG. 3 relating to SSS for each word stored in the frequency calculation target word table output from the morpheme analysis unit 6. In FIG. 3, a, b, c, and d relating to the word w are the following values.
a: Of the set of relevant documents DOC _M, the number of documents that contain the word w b: out of the set of relevant documents DOC _M, the number of documents that do not contain the word w c: of non-relevant documents of the set DOC _N Of these, the number d of documents containing the word w: the number of documents that do not contain the word w in the incorrect document set DOC _N

スコア計算部１２は、形態素解析部６から出力された頻度計算対象単語表に格納される各単語を対象にして、ＭＳＳに関する図４に示される２×４分割表を作成する。図４において、単語ｗ_１と単語ｗ_２の組み合わせに関するＮ１１、Ｎ１２、Ｎ１３、Ｎ１４、Ｎ２１、Ｎ２２、Ｎ２３、Ｎ２４は以下の値である。
Ｎ１１：正解文書の集合ＤＯＣ_Ｍのうち、単語ｗ_１を含む且つ単語ｗ_２を含む文書の数
Ｎ１２：正解文書の集合ＤＯＣ_Ｍのうち、単語ｗ_１を含む且つ単語ｗ_２を含まない文書の数
Ｎ１３：正解文書の集合ＤＯＣ_Ｍのうち、単語ｗ_１を含まない且つ単語ｗ_２を含む文書の数
Ｎ１４：正解文書の集合ＤＯＣ_Ｍのうち、単語ｗ_１を含まない且つ単語ｗ_２を含まない文書の数
Ｎ２１：不正解文書の集合ＤＯＣ_Ｎのうち、単語ｗ_１を含む且つ単語ｗ_２を含む文書の数
Ｎ２２：不正解文書の集合ＤＯＣ_Ｎのうち、単語ｗ_１を含む且つ単語ｗ_２を含まない文書の数
Ｎ２３：不正解文書の集合ＤＯＣ_Ｎのうち、単語ｗ_１を含まない且つ単語ｗ_２を含む文書の数
Ｎ２４：不正解文書の集合ＤＯＣ_Ｎのうち、単語ｗ_１を含まない且つ単語ｗ_２を含まない文書の数 The score calculation unit 12 creates a 2 × 4 contingency table shown in FIG. 4 related to the MSS for each word stored in the frequency calculation target word table output from the morpheme analysis unit 6. In FIG. 4, N11 relates to a combination of words _{w 1} and word _{w 2, N12, N13, N14} , N21, N22, N23, N24 are the following values.
N11: Among the set DOC _M of the correct document, the words _{w 1} the number of and the document that contains the word _{w 2,} including the N12: out of the set of relevant documents DOC _M, of and documents that do not contain the word _{w 2} containing the word _{w 1} number N13: among the set of relevant documents DOC _M, word _{w 1} the number of and the document that contains the word _{w 2} does not contain N14: among the set of relevant documents DOC _M, do not contain the word _{w 1} and contain the word _{w 2} not the number of documents N21: among the set DOC _N of incorrect document, the words _{w 1} the number of and the document that contains the word _{w 2,} including the N22: among the set DOC _N of non-relevant documents, and word w containing the word _{w 1} ₂ the number of documents that do not contain N23: of the set DOC _N of incorrect document, the number of documents and do not contain the word _{w 1} containing the word _{w 2} N24: of the set DOC _N of incorrect document, the words _{w 1} and the word w ₂ including that does not include The number of no document

なお、Ｎ１２、Ｎ１３、Ｎ２２及びＮ２３に関して、図３に示される２×２分割表中のａ及びｃ、並びにＮ１１との間で以下の関係式が成り立つ。但し、単語ｗ_１に係る２×２分割表中のａ、ｃをａ（ｗ_１）、ｃ（ｗ_１）とし、単語ｗ_２に係る２×２分割表中のａ、ｃをａ（ｗ_２）、ｃ（ｗ_２）とする。
Ｎ１２＝ａ（ｗ_１）−Ｎ１１
Ｎ１３＝ａ（ｗ_２）−Ｎ１１
Ｎ２２＝ｃ（ｗ_１）−Ｎ１１
Ｎ２３＝ｃ（ｗ_２）−Ｎ１１ Regarding N12, N13, N22, and N23, the following relational expressions hold between a and c in the 2 × 2 contingency table shown in FIG. 3 and N11. However, a and c in the 2 × 2 contingency table related to the word w ₁ are a (w ₁ ) and c (w ₁ ), and a and c in the 2 × 2 contingency table related to the word w ₂ are a (w ₂ ) and c (w ₂ ).
N12 = a (w ₁ ) −N11
N13 = a (w ₂ ) −N11
N22 = c (w ₁ ) −N11
N23 = c (w ₂ ) −N11

又、文書の総数Ｚは以下の関係式となる。
Ｚ＝Ｎ１１＋Ｎ１２＋Ｎ１３＋Ｎ１４＋Ｎ２１＋Ｎ２２＋Ｎ２３＋Ｎ２４ The total number Z of documents is expressed by the following relational expression.
Z = N11 + N12 + N13 + N14 + N21 + N22 + N23 + N24

［ステップＳ４：ＡＩＣ（情報量基準量）の計算］
スコア計算部１２は、上記で算出した第１から第８の文書数（Ｎ１１、Ｎ１２、Ｎ１３、Ｎ１４、Ｎ２１、Ｎ２２、Ｎ２３、Ｎ２４）を用いて、第１から第４の情報量基準量（ＡＩＣ（Ｍ１）、ＡＩＣ（Ｍ２）、ＡＩＣ（Ｍ３）、ＡＩＣ（Ｍ０））を算出する。 [Step S4: Calculation of AIC (Information Reference Amount)]
The score calculation unit 12 uses the first to eighth document numbers (N11, N12, N13, N14, N21, N22, N23, and N24) calculated above to use the first to fourth information amount reference amounts ( AIC (M1), AIC (M2), AIC (M3), AIC (M0)) are calculated.

ＡＩＣ（Ｍ１）は、単語ｗ_１が特定のトピックｘに関係ありの度合いを示す。ＡＩＣ（Ｍ１）は、式（１）により算出される。但し、このＡＩＣ（Ｍ１）は、値が小さいほど、特定のトピックｘに関係ありの度合いが大きい。なお、以下、ｌｏｇの底である１０は省略して表記する。 AIC (M1) indicates the degree to which the word w ₁ is related to a specific topic x. AIC (M1) is calculated by equation (1). However, this AIC (M1) is more related to a specific topic x as the value is smaller. Hereinafter, 10 which is the bottom of the log is omitted.

ＡＩＣ（Ｍ２）は、単語ｗ_２が特定のトピックｘに関係ありの度合いを示す。ＡＩＣ（Ｍ２）は、式（２）により算出される。但し、このＡＩＣ（Ｍ２）は、値が小さいほど、特定のトピックｘに関係ありの度合いが大きい。 AIC (M2), the word _{w 2} is indicative of a degree of relevant to a particular topic x. AIC (M2) is calculated by equation (2). However, this AIC (M2) is more related to a specific topic x as the value is smaller.

ＡＩＣ（Ｍ３）は、単語ｗ_１と単語ｗ_２の組み合わせが特定のトピックｘに関係ありの度合いを示す。ＡＩＣ（Ｍ３）は、式（３）により算出される。但し、このＡＩＣ（Ｍ３）は、値が小さいほど、特定のトピックｘに関係ありの度合いが大きい。 AIC (M3) indicates the degree to which the combination of the word w ₁ and the word w ₂ is related to a specific topic x. AIC (M3) is calculated by equation (3). However, this AIC (M3) is more related to a specific topic x as the value is smaller.

ＡＩＣ（Ｍ０）は、単語ｗ_１も、単語ｗ_２も、単語ｗ_１と単語ｗ_２の組み合わせも、全てが特定のトピックｘに関係なしである度合いを示す。ＡＩＣ（Ｍ０）は、式（４）により算出される。但し、このＡＩＣ（Ｍ０）は、値が小さいほど、特定のトピックｘに関係なしの度合いが大きい。 AIC (M0), the word _{w 1} also, word _{w 2} also, a combination of the words _{w 1} and the word _{w 2} also shows the degree are all without regard to a particular topic x. AIC (M0) is calculated by equation (4). However, the smaller the value of this AIC (M0), the greater the degree of irrelevance to a specific topic x.

すなわち、スコア計算部１２は、正解文書および不正解文書に含まれる連続する二つ以上の単語を抽出し、それらの単語が単独で出現する場合、および、二つ以上連続して出現する場合それぞれについて、正解文書および不正解文書内に含まれる回数を算出し、算出した回数に基づき連続する二つ以上の単語の組み合わせに対する情報量基準量を算出し、算出した情報量基準量に基づき辞書登録候補を選択する。
［ステップＳ５：辞書登録候補の選択］
辞書登録候補選択部１６は、スコア計算部１２が算出した第１から第４の情報量基準量（ＡＩＣ（Ｍ１）、ＡＩＣ（Ｍ２）、ＡＩＣ（Ｍ３）、ＡＩＣ（Ｍ０））を用いて、辞書登録候補を選択する。辞書登録候補は、単語ｗ_１、単語ｗ_２、及び単語ｗ_１と単語ｗ_２の組み合わせである。以下、本実施形態に係る辞書登録候補選択処理を説明する。 That is, the score calculation unit 12 extracts two or more consecutive words included in the correct answer document and the incorrect answer document, and when these words appear alone and when they appear two or more consecutively, respectively. For the correct answer document and the incorrect answer document, calculate the information amount reference amount for a combination of two or more consecutive words based on the calculated number of times, and register the dictionary based on the calculated information amount reference amount Select a candidate.
[Step S5: Selection of Dictionary Registration Candidate]
The dictionary registration candidate selection unit 16 uses the first to fourth information amount reference amounts (AIC (M1), AIC (M2), AIC (M3), AIC (M0)) calculated by the score calculation unit 12, Select dictionary registration candidates. The dictionary registration candidates are the word w ₁ , the word w ₂ , and the combination of the word w ₁ and the word w ₂ . Hereinafter, dictionary registration candidate selection processing according to the present embodiment will be described.

本実施形態では、情報基準量に基づくモデル検定を行って辞書登録候補を選択する。ＡＩＣ（Ｍ１）は、単語ｗ_１に係る従属関係のモデルの情報基準量であり、特定のトピックｘに係る判定の際に単語ｗ_１のみを使うべきであることを示す尺度となる。ＡＩＣ（Ｍ２）は、単語ｗ_２に係る従属関係のモデルの情報基準量であり、特定のトピックｘに係る判定の際に単語ｗ_２のみを使うべきであることを示す尺度となる。ＡＩＣ（Ｍ３）は、単語ｗ_１及び単語ｗ_２の両方に係る従属関係のモデルの情報基準量であり、特定のトピックｘに係る判定の際に単語ｗ_１及び単語ｗ_２の両方を使うべきであることを示す尺度となる。ＡＩＣ（Ｍ０）は、独立モデルの情報基準量であり、特定のトピックｘに係る判定の際に単語ｗ_１も単語ｗ_２も単語ｗ_１と単語ｗ_２の組み合わせも使うべきではないことを示す尺度となる。 In the present embodiment, model registration based on the information reference amount is performed to select dictionary registration candidates. AIC (M1) is the information reference amount model dependency of the words w _1, a measure indicating that it should use only words w ₁ in the determination of the specific topic x. AIC (M2) is the information reference amount of model dependencies according to word w _2, a measure indicating that it should use only words w ₂ in the determination of the specific topic x. AIC (M3) is information reference amount of model dependencies according to both words w ₁ and word w _2, should be used both words w ₁ and word w ₂ in the determination according to specific topics x It is a scale that shows that. AIC (M0) is an information reference amount of the independent model, and indicates that neither word w ₁ nor word w ₂ nor a combination of word w ₁ and word w ₂ should be used in the determination relating to a specific topic x. It becomes a scale.

まず、辞書登録候補選択部１６は、ＡＩＣ（Ｍ１）、ＡＩＣ（Ｍ２）、ＡＩＣ（Ｍ３）及びＡＩＣ（Ｍ０）をそれぞれ比較し、値が最小であるものを選択する。この選択結果がＡＩＣ（Ｍ０）であった場合には、辞書登録候補選択処理を終了する。一方、ＡＩＣ（Ｍ１）、ＡＩＣ（Ｍ２）又はＡＩＣ（Ｍ３）のいずれかが選択された場合には、その選択結果に該当する辞書登録候補を選択する。但し、辞書登録候補を選択する際に以下の制約（１）、（２）、（３）及び（４）を設ける。 First, the dictionary registration candidate selection unit 16 compares AIC (M1), AIC (M2), AIC (M3), and AIC (M0), and selects the one with the smallest value. If the selection result is AIC (M0), the dictionary registration candidate selection process is terminated. On the other hand, when any of AIC (M1), AIC (M2), or AIC (M3) is selected, a dictionary registration candidate corresponding to the selection result is selected. However, the following restrictions (1), (2), (3), and (4) are provided when selecting dictionary registration candidates.

（１）ＡＩＣ（Ｍ１）が最小である場合において、単語ｗ_１が不正解文書中よりも正解文書中により多く含まれるときにのみ単語ｗ_１を選択する。具体的には、次式が成立する場合にのみ、単語ｗ_１を選択する。
（Ｎ１１＋Ｎ１２）÷（Ｎ１１＋Ｎ１２＋Ｎ２１＋Ｎ２２）＞（Ｎ１３＋Ｎ１４）÷（Ｎ１３＋Ｎ１４＋Ｎ２３＋Ｎ２４） (1) In the case AIC (M1) is the minimum, the words w ₁ selects a word w ₁ only when contained many by in relevant documents than in incorrect document. More specifically, only when the following equation is established, it selects the word w _1.
(N11 + N12) / (N11 + N12 + N21 + N22)> (N13 + N14) / (N13 + N14 + N23 + N24)

（２）ＡＩＣ（Ｍ２）が最小である場合において、単語ｗ_２が不正解文書中よりも正解文書中により多く含まれるときにのみ単語ｗ_２を選択する。具体的には、次式が成立する場合にのみ、単語ｗ_２を選択する。
（Ｎ１１＋Ｎ１３）÷（Ｎ１１＋Ｎ１３＋Ｎ２１＋Ｎ２３）＞（Ｎ１２＋Ｎ１４）÷（Ｎ１２＋Ｎ１４＋Ｎ２２＋Ｎ２４） (2) In the case AIC (M2) is the smallest, selecting the word w ₂ only when the word w ₂ than in incorrect document contained many by in relevant documents. More specifically, only when the following equation is established, it selects the word w _2.
(N11 + N13) / (N11 + N13 + N21 + N23)> (N12 + N14) / (N12 + N14 + N22 + N24)

（３）ＡＩＣ（Ｍ３）が最小である場合において、単語ｗ_１と単語ｗ_２の組み合わせが不正解文書中よりも正解文書中により多く含まれるときにのみ単語ｗ_１と単語ｗ_２の組み合わせを選択する。具体的には、次式が成立する場合にのみ、単語ｗ_１と単語ｗ_２の組み合わせを選択する。
Ｎ１１÷（Ｎ１１＋Ｎ２１）＞（Ｎ１２＋Ｎ１３＋Ｎ１４）÷（Ｎ１２＋Ｎ１３＋Ｎ１４＋Ｎ２２＋Ｎ２３＋Ｎ２４） (3) In the case AIC (M3) is the minimum, the combination of words w ₁ and word w ₂ only when a combination of words w ₁ and word w ₂ is contained in a large amount by the correct answer in the document than in the incorrect document select. Specifically, the combination of the word w ₁ and the word w ₂ is selected only when the following formula is satisfied.
N11 ÷ (N11 + N21)> (N12 + N13 + N14) ÷ (N12 + N13 + N14 + N22 + N23 + N24)

次いで、辞書登録候補選択部１６は、選択結果の辞書登録候補に関するスコアを計算する。各辞書登録候補のスコアの計算式を以下に示す。
単語ｗ_１のスコアＥ（Ｍ１）＝ＡＩＣ（Ｍ０）−ＡＩＣ（Ｍ１）
単語ｗ_２のスコアＥ（Ｍ２）＝ＡＩＣ（Ｍ０）−ＡＩＣ（Ｍ２）
単語ｗ_１と単語ｗ_２の組み合わせのスコアＥ（Ｍ３）＝ＡＩＣ（Ｍ０）−ＡＩＣ（Ｍ３） Next, the dictionary registration candidate selection unit 16 calculates a score relating to the dictionary registration candidate of the selection result. The formula for calculating the score of each dictionary registration candidate is shown below.
Score E (M1) of word w ₁ = AIC (M0) −AIC (M1)
Score E (M2) = AIC (M0) −AIC (M2) of word w ₂
Word _{w 1} and the word _{w 2} combination of score E (M3) = AIC (M0 ) -AIC (M3)

次いで、辞書登録候補選択部１６は、選択結果の辞書登録候補とそのスコアを記録する。 Next, the dictionary registration candidate selection unit 16 records the dictionary registration candidate as a selection result and its score.

次いで、辞書登録候補選択部１６は、既に選択した情報量基準量以外の情報量基準量を対象にして、値が最小であるものを選択し、上記と同様に辞書登録候補の選択を行う。辞書登録候補選択部１６は、ＡＩＣ（Ｍ０）を選択するまで、上記した辞書登録候補選択処理を繰り返す。 Next, the dictionary registration candidate selection unit 16 selects an information amount reference amount other than the already selected information amount reference amount as a target, and selects a dictionary registration candidate in the same manner as described above. The dictionary registration candidate selection unit 16 repeats the above-described dictionary registration candidate selection process until AIC (M0) is selected.

図５は、本実施形態に係る辞書登録候補選択処理のプログラムの例である。図５に示すプログラムはＣ言語で記述されている。 FIG. 5 shows an example of a program for dictionary registration candidate selection processing according to the present embodiment. The program shown in FIG. 5 is written in C language.

次いで、辞書登録候補選択部１６は、記録された辞書登録候補の中に、同じ辞書登録候補が複数あるかを調べる。この結果、同じ辞書登録候補が複数ある場合には、辞書登録候補選択部１６は、該複数の辞書登録候補に係るスコアのうち最小のスコアを当該辞書登録候補のスコアとする。本実施形態では、スコアの値が大きいほど、良いスコアであるので、最小のスコアを採用する。これにより、フィルタリング用辞書３０を用いた文書ラベル判定において、特定のトピックｘに該当する正解文書が過剰に検出されることを防止する効果が得られる。 Next, the dictionary registration candidate selection unit 16 checks whether there are a plurality of the same dictionary registration candidates among the recorded dictionary registration candidates. As a result, when there are a plurality of the same dictionary registration candidates, the dictionary registration candidate selection unit 16 sets the minimum score among the scores related to the plurality of dictionary registration candidates as the score of the dictionary registration candidate. In the present embodiment, the larger the score value, the better the score, so the minimum score is adopted. Thereby, in the document label determination using the filtering dictionary 30, an effect of preventing the correct document corresponding to the specific topic x from being detected excessively can be obtained.

［ステップＳ６：文書集合の更新］
入力文書フィルタ部１７は、辞書登録候補選択部１６が選択した辞書登録候補を含む正解文書及び不正解文書を正解文書の集合ＤＯＣ_Ｍ及び不正解文書の集合ＤＯＣ_Ｎから削除し、正解文書の集合ＤＯＣ_Ｍ及び不正解文書の集合ＤＯＣ_Ｎを更新する。すなわち、入力文書フィルタ部１７は、辞書登録候補選択部が選択した辞書登録候補を含む正解文書及び不正解文書を、正解文書の集合及び不正解文書の集合から削除し、新たな正解文書の集合及び不正解文書の集合を構築する。そして、スコア計算部１２は、入力文書フィルタ部１７が構築した新たな正解文書の集合及び不正解文書の集合に含まれる一つ以上の単語をそれぞれ辞書登録候補とし、連続する単語の情報量基準量を算出する。
これにより、以降の処理において、既に選択された辞書登録候補を含まない正解文書及び不正解文書から新たな辞書登録候補を選択することを保証することができる。従って、同じ辞書登録候補を重複して選択することを防ぐことができる。 [Step S6: Update Document Set]
Input document filter 17 deletes the relevant documents and incorrect document contains a dictionary registration candidate dictionary registration candidate selecting unit 16 selects from a set of relevant documents DOC _M and set DOC _N of incorrect document, a set of relevant documents DOC _M and incorrect document set DOC _N are updated. That is, the input document filter unit 17 deletes the correct answer document and the incorrect answer document including the dictionary registration candidate selected by the dictionary registration candidate selection part from the correct answer document set and the incorrect answer document set, and creates a new correct answer set. And a set of incorrect documents. Then, the score calculation unit 12 sets one or more words included in the new correct document set and the incorrect document set constructed by the input document filter unit 17 as dictionary registration candidates, and the information amount criterion for consecutive words Calculate the amount.
Accordingly, it is possible to ensure that a new dictionary registration candidate is selected from the correct document and the incorrect document that do not include the already selected dictionary registration candidate in the subsequent processing. Therefore, it is possible to prevent the same dictionary registration candidate from being selected repeatedly.

［ステップＳ７：辞書登録候補判定処理の終了判定処理］
辞書登録候補選択部１６は、頻度計算対象単語表に格納される全ての単語に対して、辞書登録候補にするか否かを判定したかを判断する。この結果、頻度計算対象単語表に格納される全ての単語に対して辞書登録候補にするか否かを判定した場合には（ステップＳ７、ＹＥＳ）、ステップＳ８に進む。一方、未だ判定していない単語が残っている場合には（ステップＳ７、ＮＯ）、ステップＳ３に戻る。 [Step S7: Termination Determination Processing for Dictionary Registration Candidate Determination Processing]
The dictionary registration candidate selection unit 16 determines whether or not all the words stored in the frequency calculation target word table have been determined as dictionary registration candidates. As a result, if it is determined whether or not all words stored in the frequency calculation target word table are to be dictionary registration candidates (YES in step S7), the process proceeds to step S8. On the other hand, when the word which has not been judged still remains (step S7, NO), it returns to step S3.

［ステップＳ８：辞書登録作業］
辞書登録部１８は、辞書登録候補選択部１６が選択して記録した辞書登録候補及びスコアをフィルタリング用辞書３０に登録する。この登録の対象となる辞書登録候補及びスコアは、単語ｗ_１とそのスコアＥ（Ｍ１）の組（スコア付き単語）、単語ｗ_２とそのスコアＥ（Ｍ２）の組（スコア付き単語）、及び単語ｗ_１と単語ｗ_２の組み合わせとそのスコアＥ（Ｍ３）の組（スコア付き単語の組み合わせ）である。 [Step S8: Dictionary Registration Work]
The dictionary registration unit 18 registers the dictionary registration candidates and scores selected and recorded by the dictionary registration candidate selection unit 16 in the filtering dictionary 30. The dictionary registration candidates and scores to be registered are a set of the word w ₁ and its score E (M1) (scored word), a set of the word w ₂ and its score E (M2) (scored word), and A combination of a word w ₁ and a word w ₂ and a score E (M3) (a combination of scored words).

ステップＳ８の辞書登録作業の結果、フィルタリング用辞書３０には、特定のトピックｘを表すラベルに対応付けてスコア付き単語及びスコア付き単語の組み合わせが格納される。 As a result of the dictionary registration operation in step S8, the filtering dictionary 30 stores scored words and combinations of scored words in association with labels representing specific topics x.

なお、フィルタリング用辞書３０に登録する辞書登録候補をスコアに基づいて絞り込むようにしてもよい。例えば、スコア上位の所定数の辞書登録候補のみをフィルタリング用辞書３０に登録したり、又は、所定の条件を満たす良スコアの辞書登録候補のみをフィルタリング用辞書３０に登録したりしてもよい。
なお、入力文書フィルタ部１７は、スコア計算部１２および辞書登録候補選択部１６が特定の文書集合に偏った単語選択が行われることを避けることを目的とした、正解文書および不正解文書の集合の中から、フィルタリング用辞書３０に既に登録されている単語で、トピック判定装置３２が正解文書として判定可能な文書を取り除いてもよい。 Note that dictionary registration candidates to be registered in the filtering dictionary 30 may be narrowed down based on the score. For example, only a predetermined number of dictionary registration candidates higher in the score may be registered in the filtering dictionary 30, or only a good score dictionary registration candidate satisfying a predetermined condition may be registered in the filtering dictionary 30.
The input document filter unit 17 is a set of correct and incorrect documents for the purpose of preventing the score calculation unit 12 and the dictionary registration candidate selection unit 16 from performing word selection biased to a specific document set. Of these, words that are already registered in the filtering dictionary 30 and that the topic determination device 32 can determine as correct documents may be removed.

上述した実施形態によれば、スコア付き単語のみによる辞書とスコア付き単語の組み合わせのみによる辞書とを混在させることができる。これにより、フィルタリング用辞書３０を用いた文書ラベル判定において、スコア付き単語がフィルタリング用辞書３０に格納されていることから特定のトピックｘに該当する正解文書を取り損なうことを防ぐと共に、スコア付き単語の組み合わせがフィルタリング用辞書３０に格納されていることから特定のトピックｘに該当する正解文書を過剰に検出してしまうことを防ぐ効果が期待できる。この結果として、文書ラベル判定の精度向上に寄与することができるようになる。 According to the embodiment described above, it is possible to mix a dictionary based only on scored words and a dictionary based only on combinations of scored words. Thereby, in the document label determination using the filtering dictionary 30, the scored word is stored in the filtering dictionary 30, thereby preventing the correct document corresponding to the specific topic x from being missed, and the scored word Since the combination is stored in the filtering dictionary 30, an effect of preventing excessive detection of correct documents corresponding to the specific topic x can be expected. As a result, it is possible to contribute to improving the accuracy of document label determination.

以上、本発明の実施形態について図面を参照して詳述してきたが、具体的な構成はこの実施形態に限られるものではなく、本発明の要旨を逸脱しない範囲の設計変更等も含まれる。
例えば、上述の実施形態では、単語ｗ_１と単語ｗ_２の組み合わせを辞書登録候補としたが、辞書登録候補として３つ以上の単語の組み合わせに対しても同様に適用可能である。 As mentioned above, although embodiment of this invention was explained in full detail with reference to drawings, the specific structure is not restricted to this embodiment, The design change etc. of the range which does not deviate from the summary of this invention are included.
For example, in the above-described embodiment, the combination of the word w ₁ and the word w ₂ is the dictionary registration candidate, but the present invention can be similarly applied to a combination of three or more words as the dictionary registration candidate.

また、図２に示す各ステップを実現するためのプログラムをコンピュータ読み取り可能な記録媒体に記録して、この記録媒体に記録されたプログラムをコンピュータシステムに読み込ませ、実行することにより、辞書生成処理を行ってもよい。なお、ここでいう「コンピュータシステム」とは、ＯＳや周辺機器等のハードウェアを含むものであってもよい。
また、「コンピュータシステム」は、ＷＷＷシステムを利用している場合であれば、ホームページ提供環境（あるいは表示環境）も含むものとする。
また、「コンピュータ読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、フラッシュメモリ等の書き込み可能な不揮発性メモリ、ＤＶＤ（Digital Versatile Disk）等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置のことをいう。 Further, a program for realizing each step shown in FIG. 2 is recorded on a computer-readable recording medium, and the program recorded on the recording medium is read into a computer system and executed, whereby dictionary generation processing is performed. You may go. Here, the “computer system” may include an OS and hardware such as peripheral devices.
Further, the “computer system” includes a homepage providing environment (or display environment) if a WWW system is used.
“Computer-readable recording medium” refers to a flexible disk, a magneto-optical disk, a ROM, a writable nonvolatile memory such as a flash memory, a portable medium such as a DVD (Digital Versatile Disk), and a built-in computer system. A storage device such as a hard disk.

さらに「コンピュータ読み取り可能な記録媒体」とは、インターネット等のネットワークや電話回線等の通信回線を介してプログラムが送信された場合のサーバやクライアントとなるコンピュータシステム内部の揮発性メモリ（例えばＤＲＡＭ（Dynamic Random Access Memory））のように、一定時間プログラムを保持しているものも含むものとする。
また、上記プログラムは、このプログラムを記憶装置等に格納したコンピュータシステムから、伝送媒体を介して、あるいは、伝送媒体中の伝送波により他のコンピュータシステムに伝送されてもよい。ここで、プログラムを伝送する「伝送媒体」は、インターネット等のネットワーク（通信網）や電話回線等の通信回線（通信線）のように情報を伝送する機能を有する媒体のことをいう。
また、上記プログラムは、前述した機能の一部を実現するためのものであっても良い。さらに、前述した機能をコンピュータシステムにすでに記録されているプログラムとの組み合わせで実現できるもの、いわゆる差分ファイル（差分プログラム）であっても良い。 Further, the “computer-readable recording medium” means a volatile memory (for example, DRAM (Dynamic DRAM) in a computer system that becomes a server or a client when a program is transmitted through a network such as the Internet or a communication line such as a telephone line. Random Access Memory)), etc., which hold programs for a certain period of time.
The program may be transmitted from a computer system storing the program in a storage device or the like to another computer system via a transmission medium or by a transmission wave in the transmission medium. Here, the “transmission medium” for transmitting the program refers to a medium having a function of transmitting information, such as a network (communication network) such as the Internet or a communication line (communication line) such as a telephone line.
The program may be for realizing a part of the functions described above. Furthermore, what can implement | achieve the function mentioned above in combination with the program already recorded on the computer system, and what is called a difference file (difference program) may be sufficient.

２…辞書生成装置、４…文書正規化部、６…形態素解析部、１２…スコア計算部、１６…辞書登録候補選択部、１７…入力文章フィルタ部、１８…辞書登録部、３０…フィルタリング用辞書、３２…トピック判定装置 DESCRIPTION OF SYMBOLS 2 ... Dictionary production | generation apparatus, 4 ... Document normalization part, 6 ... Morphological analysis part, 12 ... Score calculation part, 16 ... Dictionary registration candidate selection part, 17 ... Input sentence filter part, 18 ... Dictionary registration part, 30 ... For filtering Dictionary, 32 ... Topic determination device

Claims

A dictionary generation device that generates a filtering dictionary for determining whether or not a document is related to the property by using a set of correct documents related to a specific property and a set of incorrect documents not related to the property In
One or more words included in the correct answer document or the incorrect answer document as dictionary registration candidates, respectively, a score calculation unit that calculates an information amount reference amount of the consecutive words;
A dictionary registration candidate selection unit that determines a word to be registered in a filtering dictionary or a combination of consecutive words based on the information amount reference amount;
A dictionary registration unit for registering the selected dictionary registration candidate in the dictionary together with a score based on the corresponding information amount reference amount;
The correct document and the incorrect document including the dictionary registration candidate selected by the dictionary registration candidate selection unit are deleted from the set of correct documents and the set of incorrect documents, and a new set of correct documents and incorrect documents An input document filter part that constructs a set of
With
The score calculation unit uses one or more words included in a set of new correct answer documents and a set of incorrect answer documents constructed by the input document filter unit as dictionary registration candidates, respectively, and an information amount reference amount of the consecutive words A dictionary generation apparatus characterized by calculating

The input document filter unit is configured to prevent the score calculation unit and the dictionary registration candidate selection unit from performing word selection biased to a specific document set. 2. The dictionary generation device according to claim 1, wherein a word that is already registered in the filtering dictionary and that can be determined as a correct document by a topic determination device is removed.

The dictionary registration candidate selection unit
3. The combination of consecutive words is selected as a word to be registered in the filtering dictionary only when consecutive word combinations are included in the correct answer document more than in the incorrect answer document. The dictionary generation device described.

The dictionary registration unit, when a plurality of the same dictionary registration candidates are selected, sets the minimum score among the scores related to the plurality of dictionary registration candidates as the score of the dictionary registration candidate. 4. The dictionary generation device according to any one of items 1 to 3.

The score calculation unit inputs a new set of correct documents and a set of incorrect documents from the input document filter unit, and selects the selected dictionary registration candidate together with a score based on the corresponding information amount reference amount. The dictionary generation device according to any one of claims 1 to 4, wherein the dictionary generation device is registered in the dictionary.

The score calculation unit extracts two or more consecutive words included in the correct document and the incorrect document, and when each of these words appears alone and when two or more words appear consecutively, Calculating the number of times included in the correct answer document and the incorrect answer document, calculating an information amount reference amount for the combination of two or more consecutive words based on the calculated number of times, and calculating a dictionary based on the information amount reference amount 6. The dictionary generation apparatus according to claim 1, wherein a registration candidate is selected.

The dictionary generation device according to any one of claims 1 to 3,
A dictionary generation apparatus capable of handling various input documents, comprising a document normalization unit that deletes data other than text data from an input document.

The dictionary generation device according to any one of claims 1 to 3,
A filtering dictionary that stores a scored word and a combination of scored words in association with a label representing a specific property generated by the dictionary generation device;
A topic determination device that determines a label corresponding to an input document using the filtering dictionary;
A document label determination system comprising:

To generate a dictionary for determining whether the document is related to the property by using a set of correct documents related to a specific property and a set of incorrect documents not related to the property Computer program,
Calculating one or more words included in the correct document or the incorrect document as dictionary registration candidates, and calculating an information amount reference amount of the consecutive words;
Determining a word or a combination of consecutive words to be registered in the filtering dictionary based on the information amount reference amount;
Registering the selected dictionary registration candidate in the dictionary together with a score based on the corresponding information amount reference amount;
The correct document and the incorrect document including the dictionary registration candidate selected by the dictionary registration candidate selection unit are deleted from the set of correct documents and the set of incorrect documents, and a new set of correct documents and incorrect documents Constructing a set of
One or more words included in the set of the new correct answer document and the incorrect answer document set as dictionary registration candidates, and calculating an information amount reference amount of the consecutive words;
A computer program for causing a computer to execute.