JP5056337B2

JP5056337B2 - Information retrieval system

Info

Publication number: JP5056337B2
Application number: JP2007270253A
Authority: JP
Inventors: 守加藤; 光則郡
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 2007-10-17
Filing date: 2007-10-17
Publication date: 2012-10-24
Anticipated expiration: 2027-10-17
Also published as: JP2009098952A

Description

本発明は、文書中に記載されたテキストを解析して、文書をテキストの内容に応じて分類する情報検索システムに関する。 The present invention relates to an information search system that analyzes text written in a document and classifies the document according to the content of the text.

従来から、文書の自動分類に関する方法が提案されている。機械学習を用いた文書分類に関しては、例えば、特許文献１に開示された方法のように、予め定められた特定の用語グループまたは正規表現によって規則的表現を記述し、この規則的表現に合致した文字列をトークン（字句で表現されたラベル）に置き換えることにより、大きな語彙を持つ文書を分類するための学習に要する特徴要素数を削減していた。 Conventionally, methods relating to automatic document classification have been proposed. For document classification using machine learning, for example, a regular expression is described by a specific term group or regular expression determined in advance as in the method disclosed in Patent Document 1, and this regular expression is matched. By replacing character strings with tokens (labels expressed in lexical terms), the number of feature elements required for learning to classify a document having a large vocabulary has been reduced.

特表２００３−５３５４０７号公報第７６頁〜８５頁JP-T-2003-535407, pages 76-85

従来の機械学習による文書分類の例である特許文献１の方法においては、英語などのように単語ごとに区切られた文章を前提としており、日本語などのように単語区切りがない文章に対してそのまま適用することができないという問題があった。この問題を解決する技術として、単語区切りがない文章を単語ごとに分かち書きする形態素解析の技術が知られているが、形態素解析は処理量が多いために時間がかかるという課題があった。 In the method of Patent Document 1, which is an example of document classification based on conventional machine learning, a sentence that is divided into words such as English is assumed, and a sentence that does not have a word break such as Japanese is used. There was a problem that it could not be applied as it was. As a technique for solving this problem, a technique of morpheme analysis in which a sentence without a word break is divided for each word is known. However, morpheme analysis has a problem that it takes time because of a large amount of processing.

また、特許文献１の方法を機密情報の検出に適用しようとすると、例えば、機密情報の一つである個人情報をトークン化する場合には、人名の誤検出が多くなるという課題があったが、これに対しても対策がなく、トークンの誤検出による分類精度低下の問題があった。 In addition, when the method of Patent Document 1 is applied to detection of confidential information, for example, when personal information that is one of confidential information is tokenized, there is a problem that erroneous detection of personal names increases. However, there was no countermeasure against this, and there was a problem that classification accuracy was lowered due to erroneous detection of tokens.

この発明は上記のような問題点を解決するためになされたもので、日本語などのように単語区切りがない文章を処理対象とし、文字列照合による高速な文章のトークン化を特徴とする情報検索システムを提供することを目的とする。また、文字列照合によって得られたトークンの誤検出を低減させて分類精度を高めることを特徴とする情報検索システムを提供することを目的とする。 The present invention has been made to solve the above-described problems, and is characterized by high-speed text tokenization by character string matching, which is targeted for texts such as Japanese that have no word breaks. The purpose is to provide a search system. It is another object of the present invention to provide an information search system characterized by reducing the erroneous detection of tokens obtained by character string matching and improving the classification accuracy.

上記で述べた課題を解決するため、本発明に係る情報検索システムは、文字列とキーワードとの照合条件と前記照合条件を識別する特徴トークンとを対応付けて記憶する照合条件記憶手段と、前記照合条件記憶手段に記憶された前記照合条件と前記特徴トークンとに基づいて、カテゴリ別に予め分類された学習用文書の文字列と前記キーワードとを照合して、合致した前記照合条件に対応する第１の特徴トークンを前記カテゴリと対応付けて抽出し、また、前記照合条件記憶手段に記憶された前記照合条件と前記特徴トークンとに基づいて、前記カテゴリ別に分類される分類対象文書の文字列と前記キーワードとを照合して、合致した前記照合条件に対応する第２の特徴トークンを抽出する特徴トークン抽出手段と、前記第１の特徴トークンが抽出されなかった前記学習用文書の文字列を文字単位に分割した第１の非特徴トークンを前記カテゴリと対応付けて抽出し、また、前記第２の特徴トークンが抽出されなかった前記分類対象文書の文字列を文字単位に分割した第２の非特徴トークンを抽出する非特徴トークン抽出手段と、前記第１の特徴トークンと前記第１の非特徴トークンとにより構成された第１のトークン列の出現頻度を学習頻度として前記カテゴリに対応付けて算出する学習手段と、前記第２の特徴トークンと前記第２の非特徴トークンとにより構成された第２のトークン列の出現頻度と、前記学習手段により算出された前記学習頻度との類似度を示す分類確率を前記カテゴリ別に算出し、前記分類確率が所定の閾値を越える前記カテゴリに前記分類対象文書を分類する分類手段とを備えることとしたものである。 In order to solve the above-described problem, an information search system according to the present invention includes a matching condition storage unit that stores a matching condition between a character string and a keyword and a feature token that identifies the matching condition, and Based on the collation condition stored in the collation condition storage means and the feature token, the character string of the learning document classified in advance by category is collated with the keyword, and the matching condition corresponding to the collation condition is matched. One characteristic token is extracted in association with the category, and based on the matching condition and the feature token stored in the matching condition storage unit, A feature token extracting means for matching the keyword and extracting a second feature token corresponding to the matched matching condition; and the first feature token The first non-feature token obtained by dividing the character string of the learning document that has not been extracted in character units is extracted in association with the category, and the classification target document from which the second feature token has not been extracted A non-characteristic token extracting means for extracting a second non-characteristic token obtained by dividing the character string into character units, and a first token string composed of the first characteristic token and the first non-characteristic token. Learning means for calculating an appearance frequency as a learning frequency in association with the category, an appearance frequency of a second token string composed of the second feature token and the second non-feature token, and the learning means A classification probability indicating similarity to the learning frequency calculated by the above is calculated for each category, and the classification target document is classified into the category in which the classification probability exceeds a predetermined threshold. It is obtained by a further comprising a classification unit that.

本発明によれば、抽出すべき特徴トークンを定義した照合条件を用いて、入力文書から特徴トークンと非特徴トークンとからなるシーケンスを抽出し、照合条件の優先順位付けやトークンの連鎖確率を利用して学習あるいは分類するようにしたので、単語区切りの無い文章を含む文書を入力した場合にも、形態素解析に比べて処理の速い文字列照合を用いてトークン化が行なえるので高速な処理が可能となり、さらに、特徴トークンの誤検出を防止して、分類精度を向上させることができるようになるという効果がある。 According to the present invention, using a matching condition that defines a feature token to be extracted, a sequence of feature tokens and non-feature tokens is extracted from an input document, and prioritization of matching conditions and token chain probability are used. Therefore, even if a document containing sentences without word breaks is input, tokenization can be performed using character string matching that is faster than morphological analysis, so high-speed processing is possible. In addition, there is an effect that the classification token can be improved by preventing erroneous detection of the feature token.

以下の説明では、実施の形態として機密情報検索を例としてあげるが、この発明は機密情報検索に限定されるものではなく、広く一般に文書の分類に用いることができるものである。また、以下の説明では、日本語文書の検索を例としてあげるが、この発明の用途は日本語に限定されるものではなく、どのような文字コードでも適用可能である。 In the following description, confidential information retrieval is taken as an example as an embodiment, but the present invention is not limited to confidential information retrieval and can be widely used for document classification in general. In the following description, a search for a Japanese document is taken as an example, but the application of the present invention is not limited to Japanese, and any character code can be applied.

実施の形態１．
図１は、実施の形態１における情報検索システムの一例を示す構成図である。
この情報検索システムは、前処理手段１００と、学習手段２００と、分類手段３００と、前処理手段１００に入力される照合条件を記憶する照合条件記憶手段４００とから構成される。 Embodiment 1 FIG.
FIG. 1 is a configuration diagram illustrating an example of an information search system according to the first embodiment.
This information retrieval system includes a preprocessing unit 100, a learning unit 200, a classification unit 300, and a collation condition storage unit 400 that stores collation conditions input to the preprocessing unit 100.

前処理手段１００はさらに、テキスト抽出手段１０１と、特徴トークン抽出手段１０２と、非特徴トークン抽出手段１０３とを備える。学習手段２００は、学習用にトークンの頻度を計算する学習用頻度計算手段２０１と、計算されたトークンの頻度を分類のカテゴリごとに蓄積し、学習頻度として記憶する学習頻度記憶手段２０２とを備える。分類手段３００は、分類用にトークンの頻度を計算する分類用頻度計算手段３０１と、この分類用のトークンの頻度と学習頻度記憶手段２０２に記憶されたトークンの学習頻度とに基づいて、入力文書の分類確率を算出する分類確率算出手段３０２と、最終的に入力文書のカテゴリを判定するカテゴリ判定手段３０３を備える。 The preprocessing unit 100 further includes a text extraction unit 101, a feature token extraction unit 102, and a non-feature token extraction unit 103. The learning unit 200 includes a learning frequency calculation unit 201 that calculates the token frequency for learning, and a learning frequency storage unit 202 that accumulates the calculated token frequency for each category of classification and stores it as a learning frequency. . The classification means 300 calculates the input frequency based on the classification frequency calculation means 301 for calculating the token frequency for classification, the token frequency for classification, and the token learning frequency stored in the learning frequency storage means 202. Classification probability calculation means 302 for calculating the classification probability of the input document, and category determination means 303 for finally determining the category of the input document.

学習用文書５０１は、予め複数のカテゴリに分類された複数の文書のセットである。カテゴリは、具体的には、（「非機密文書」、「機密文書」）のように二つの分類で表わすことができる。あるいは、機密文書の機密等級レベルに応じて、（「非機密文書」、「機密レベル１文書」、「機密レベル２文書」、・・・）などのように三つ以上の分類で表わしても良い。 The learning document 501 is a set of a plurality of documents previously classified into a plurality of categories. Specifically, the category can be expressed by two classifications such as (“non-confidential document”, “confidential document”). Or, according to the confidentiality level of the confidential document, it may be expressed by three or more classifications such as (“non-confidential document”, “confidential level 1 document”, “confidential level 2 document”,...). good.

一方、分類対象文書５０２は、カテゴリの分からない文書であり、分類の対象となる文書の集合である。この分類対象文書５０２の属するカテゴリが、本実施の形態１の情報検索システムにより判定される。 On the other hand, the classification target document 502 is a document whose category is not known, and is a set of documents to be classified. The category to which the classification target document 502 belongs is determined by the information search system of the first embodiment.

照合条件記憶手段４００は、前処理手段１００に入力される照合条件を記憶するものである。この照合条件は、学習や分類に先立って設定されるものであり、組み込み照合条件４００１とユーザ定義照合条件４００２とに分けることができる。組み込み照合条件４００１は、情報検索システムの出荷時に予め組み込まれた照合条件であり、これを基本的な照合条件としてユーザに提供することで、ユーザは直ちに情報検索システムの利用を開始できるようにするものである。また、ユーザ定義照合条件４００２は、各ユーザが、特有の用語などを追加して照合条件をカスタマイズできるようにするものである。 The collation condition storage unit 400 stores the collation conditions input to the preprocessing unit 100. This collation condition is set prior to learning and classification, and can be divided into a built-in collation condition 4001 and a user-defined collation condition 4002. The built-in collation condition 4001 is a collation condition incorporated in advance at the time of shipping of the information search system. By providing this as a basic collation condition, the user can immediately start using the information search system. Is. The user-defined matching condition 4002 allows each user to customize a matching condition by adding a unique term or the like.

上記の照合条件は、キーワードと、照合条件を識別するための照合条件ＩＤとの組の集合で表される。ここでいうキーワードとは、単語や文字列の上位概念を示す複数の用語クラスに対して、各用語クラスに属する単語や文字列を指定する表現形式を意味する。キーワードの表現形式は、複数の固定の用語を羅列したものでも良いし、正規表現により記述したものでも良い。 Said collation conditions are represented by the set of the combination of a keyword and collation condition ID for identifying collation conditions. The keyword here means an expression format for designating a word or character string belonging to each term class with respect to a plurality of term classes indicating a general concept of the word or character string. The keyword expression format may be a list of a plurality of fixed terms, or may be a regular expression.

次に、図２から図６を適宜参照しながら、本発明の実施の形態１における情報検索システムの動作を詳細に説明する。 Next, the operation of the information search system according to Embodiment 1 of the present invention will be described in detail with reference to FIGS. 2 to 6 as appropriate.

まず、図２を参照して、学習段階の動作について説明する。
図２は、実施の形態１における情報検索システムの動作を示すフローチャートである。
ステップＳ１１において、予めカテゴリ分けされた学習用文書５０１が、前処理手段１００に入力される。次に、ステップＳ１２において、前処理手段１００により、学習用文書５０１からトークン列が抽出される。次に、ステップＳ１３において、学習用頻度計算手段２０１により、抽出されたトークン列を解析して、連続するＮ個のトークンから成るシーケンスの出現頻度を計算する。ここで、連続するトークンの数は、１つ以上、かつ、Ｎ個以下であっても良い。次に、ステップＳ１４において、学習頻度記憶手段２０２により、学習されたトークンの頻度が、カテゴリごとに学習頻度として記憶される。さらに、ステップＳ１５において、学習手段２００により、学習用文書５０１の全文書の学習が完了したか否かを判定する。判定の結果、学習用文書５０１に複数の文書が含まれる場合には、ＮＯの分岐へ進み、ステップＳ１１以下の動作を繰り返す。一方、判定の結果、全文書が完了した場合には、Ｙｅｓの分岐へ進み、学習段階の動作を終了する。 First, the operation in the learning stage will be described with reference to FIG.
FIG. 2 is a flowchart showing the operation of the information search system in the first embodiment.
In step S <b> 11, learning documents 501 categorized in advance are input to the preprocessing unit 100. Next, in step S <b> 12, the preprocessing unit 100 extracts a token string from the learning document 501. Next, in step S13, the extracted token string is analyzed by the learning frequency calculation means 201, and the appearance frequency of a sequence composed of consecutive N tokens is calculated. Here, the number of consecutive tokens may be one or more and N or less. Next, in step S14, the learning frequency storage means 202 stores the learned token frequency as a learning frequency for each category. In step S15, the learning unit 200 determines whether learning of all the documents of the learning document 501 has been completed. As a result of the determination, if the learning document 501 includes a plurality of documents, the process proceeds to a NO branch, and the operations after step S11 are repeated. On the other hand, if all the documents are completed as a result of the determination, the process proceeds to a Yes branch, and the learning stage operation is terminated.

次に、図３を参照して、分類段階の動作について説明する。
図３は、実施の形態１において、分類段階の動作を示すフローチャートである。
ステップＳ２１において、分類対象文書５０２が前処理手段１００に入力される。次に、ステップＳ２２において、前処理手段１００により、分類対象文書５０２からトークン列が抽出される。次に、ステップＳ２３において、分類用頻度計算手段３０１により、抽出されたトークン列を解析して、連続するＮ個のトークンから成るシーケンスの出現頻度を計算する。ここで、連続するトークンの数は、１つ以上、かつ、Ｎ個以下であっても良い。次に、ステップＳ２４において、分類確率算出手段３０２により、学習結果に基づき、分類対象文書５０２が、各カテゴリに分類される確率を計算する。次に、ステップＳ２５において、カテゴリ判定手段３０３により、分類対象文書５０２が、いずれのカテゴリに分類されるかを、ステップＳ２４で算出された確率に基づいて判定する。最後に、ステップＳ２６において、分類手段３０３は、ステップＳ２５で判定された分類先のカテゴリを分類結果として出力する（Ｓ２６）。 Next, the operation in the classification stage will be described with reference to FIG.
FIG. 3 is a flowchart showing the operation of the classification stage in the first embodiment.
In step S <b> 21, the classification target document 502 is input to the preprocessing unit 100. Next, in step S <b> 22, the pre-processing unit 100 extracts a token string from the classification target document 502. Next, in step S23, the classification frequency calculation unit 301 analyzes the extracted token string and calculates the appearance frequency of a sequence composed of consecutive N tokens. Here, the number of consecutive tokens may be one or more and N or less. Next, in step S24, the classification probability calculation unit 302 calculates the probability that the classification target document 502 is classified into each category based on the learning result. Next, in step S25, the category determination unit 303 determines to which category the classification target document 502 is classified based on the probability calculated in step S24. Finally, in step S26, the classification unit 303 outputs the classification destination category determined in step S25 as a classification result (S26).

上記で述べた学習および分類の動作を実現するためには、非特許文献１にCRM114として示されているテキスト分類ソフトウェアを用いることもできる。 In order to realize the learning and classification operations described above, it is possible to use text classification software shown as CRM 114 in Non-Patent Document 1.

William S. Yerazunis著「Sparse Binary Polynomial Hashing and the CRM114 Discriminator」, MIT Spam Conference 2003, ２００３年１月１７日"Sparse Binary Polynomial Hashing and the CRM114 Discriminator" by William S. Yerazunis, MIT Spam Conference 2003, January 17, 2003

なお、学習動作は、最初にシステム初期化の段階で一括して実行するようにしても良い。さらに、分類動作時に分類誤りなどが発生した場合には、分類誤りが発生した文書を学習用として再度学習することにより、学習データの更新を行なうようにしても良い。 It should be noted that the learning operation may be executed collectively in the initial stage of system initialization. Furthermore, when a classification error occurs during the classification operation, the learning data may be updated by re-learning the document in which the classification error has occurred for learning.

次に、図４を参照して、前処理手段１００の動作をより詳細に説明する。
図４は、実施の形態１において、前処理手段１００の動作を示すフローチャートである。
ステップＳ３１において、予めカテゴリ分けされた学習用文書５０１、あるいは、分類対象文書５０２が入力文書として前処理手段１００に入力される。次に、ステップＳ３２において、テキスト抽出手段１０１により、入力文書から、自然言語で表現された記述であるテキストが抽出される（Ｓ３２）。 Next, the operation of the preprocessing unit 100 will be described in more detail with reference to FIG.
FIG. 4 is a flowchart showing the operation of the preprocessing unit 100 in the first embodiment.
In step S31, the learning document 501 or the classification target document 502 classified in advance is input to the preprocessing unit 100 as an input document. Next, in step S32, text that is a description expressed in a natural language is extracted from the input document by the text extraction unit 101 (S32).

テキスト抽出手段１０１は、任意の形式の文書からテキストを抽出するものである。任意の形式の文書としては、例えば、市販の文書編集ソフトウエアが生成する種々の形式の文書や、電子メール、ＨＴＭＬ（ＨｙｐｅｒＴｅｘｔＭａｒｋｕｐＬａｎｇｕａｇｅ）などのＷｅｂ上で用いられる文書などのように、テキストが含まれる文書であれば何であっても良い。また、テキスト抽出手段１０１は、市販のテキスト抽出ソフトウエアを用いることもできる。さらに、テキスト抽出手段１０１の出力としては、文書中の文字列からスペース、タブや改行などを取り除いたものとしてもよい。これにより、用語の途中にスペース、タブや改行が入ることに起因する検出漏れを削減できるようになる。 The text extracting unit 101 extracts text from a document of an arbitrary format. Examples of arbitrary format documents include texts such as various types of documents generated by commercially available document editing software and documents used on the Web such as e-mail and HTML (HyperText Markup Language). Any document may be included. The text extraction means 101 can also use commercially available text extraction software. Further, the output of the text extraction means 101 may be a character string in the document with spaces, tabs, line feeds, etc. removed. As a result, it is possible to reduce detection omissions caused by a space, tab or line feed in the middle of a term.

次に、ステップＳ３３において、特徴トークン抽出手段１０２により、ステップＳ３２で抽出されたテキストから、特徴トークンが抽出される。ここでの特徴トークンとは、照合条件記憶手段４００で記憶された照合条件において、各照合条件に対応して設定された用語クラスを表現する文字列を意味する。この特徴トークンは、用語クラス名そのものや、照合条件ＩＤそのものでも良いが、後述の非特徴トークン抽出手段１０３で文字列として抽出される非特徴トークンとの混同が無いように、例えば、照合条件ＩＤ＝１に対応して、半角英数字による「ＴＯＫＥＮ＿１」などのように特別の文字列を用いることもできる。 Next, in step S33, the feature token is extracted from the text extracted in step S32 by the feature token extraction means 102. The feature token here means a character string representing a term class set corresponding to each matching condition in the matching conditions stored in the matching condition storage unit 400. This feature token may be the term class name itself or the collation condition ID itself, but for example, the collation condition ID is not confused with a non-feature token extracted as a character string by the non-feature token extraction means 103 described later. Corresponding to = 1, a special character string such as “TOKEN_1” composed of alphanumeric characters can be used.

特徴トークン抽出手段１０２は、照合条件記憶手段４００で記憶された照合条件を参照して、ステップＳ３２で抽出されたテキストと、照合条件に設定されたキーワードとの照合を行ない、キーワードと合致したテキスト中の文字列を、合致した照合条件ＩＤに対応する特徴トークンに置き換える。このような照合処理、および、特徴トークンへの置き換え処理は、照合条件中のキーワードが複数の固定の用語で与えられる場合、次のようになる。特徴トークン抽出手段１０２は、全ての照合条件に対して、各照合条件に属する全ての用語とテキストとの文字列比較を行なって照合し、一致した場合に、その文字列を特徴トークンに置き換える。この処理を、テキストの照合位置を１文字づつずらしながら実行する。 The feature token extraction unit 102 refers to the collation condition stored in the collation condition storage unit 400, collates the text extracted in step S32 with the keyword set in the collation condition, and matches the keyword. The inside character string is replaced with a feature token corresponding to the matched matching condition ID. Such collation processing and replacement processing with feature tokens are as follows when the keyword in the collation condition is given by a plurality of fixed terms. The feature token extraction unit 102 compares all the matching conditions by comparing the character strings of all the terms belonging to each matching condition with the text, and replaces the character string with the feature token if they match. This process is executed while shifting the text collation position one character at a time.

また、照合条件に設定されたキーワードが正規表現で与えられる場合には、例えば、一般的なＯＳ（ＯｐｅｒａｔｉｎｇＳｙｓｔｅｍ）で使用されるテキスト処理プログラムのＳＥＤや、プログラミング言語のＰＥＲＬ、その他の正規表現ライブラリなどのように、正規表現の置換機能を有するソフトウエアを用いることで実現することができる。 When the keywords set in the collation conditions are given by regular expressions, for example, SED of a text processing program used in a general OS (Operating System), PERL of a programming language, and other regular expression libraries As described above, it can be realized by using software having a regular expression replacement function.

このように、テキスト中の用語を特徴トークンに置き換えることにより、文書分類の特徴量として用いられる語彙サイズを削減でき、後段の学習手段２００による学習処理において必要となる学習量を減らすことができる。例えば、個人情報に関する特徴量として用いられる人名については、頻出人名に限っても数千語程度の用語が必要となり、この頻出人名を学習するためには、数千語よりもさらに多量の学習データが必要となるため、ユーザにとって学習データの収集が困難であるという問題点がある。しかし、上記のような特徴トークン抽出による語彙サイズ削減を行なえば、より少ない学習データで精度の高い分類が実現できるようになる。 In this way, by replacing the term in the text with the feature token, the vocabulary size used as the feature amount of the document classification can be reduced, and the learning amount necessary for the learning process by the learning means 200 at the subsequent stage can be reduced. For example, a person name used as a feature quantity related to personal information requires a term of about several thousand words even if it is limited to a frequent person name. In order to learn this frequent person name, a larger amount of learning data than a few thousand words is required. Therefore, there is a problem that it is difficult for the user to collect learning data. However, if the vocabulary size is reduced by extracting feature tokens as described above, highly accurate classification can be realized with less learning data.

次に、ステップＳ３４において、非特徴トークン抽出手段１０３により、特徴トークンとして抽出されなかったテキスト中の文字列から、非特徴トークンの抽出を行なう。ここでの非特徴トークンは、１つの文字とする。すなわち、特徴トークン以外の文字列を１文字づつ取り出したものを非特徴トークンとする。 Next, in step S34, the non-feature token extraction unit 103 extracts a non-feature token from a character string in the text that has not been extracted as a feature token. The non-character token here is a single character. That is, a character string other than the feature token is extracted character by character as a non-feature token.

さらに望ましくは、日本語文書を例に取れば、日本語文字については１文字単位での取り出しを行ない、英数字および記号については、日本語文字と英数字など文字種の切り替わりの単位での取り出しを行なうようにする。こうすることで、特徴トークンを前記の「ＴＯＫＥＮ＿１」の例のように英数字および記号により記述すれば、特徴トークン抽出手段１０２により既に抽出した特徴トークンと、それ以外の文字列とを容易に区別できるようになり、非特徴トークン抽出手段１０３による抽出処理で特徴トークンが分割されないようにするために、特徴トークンかどうかのチェックを行なうという処理段階を省略できるようになる。もちろん、日本語以外の文書に対しても同様に、非特徴トークンを取り出すこともできる。 More preferably, taking a Japanese document as an example, Japanese characters are extracted in units of one character, and alphanumeric characters and symbols are extracted in units of switching character types such as Japanese characters and alphanumeric characters. Do it. In this way, if the feature token is described by alphanumeric characters and symbols as in the above-mentioned example of “TOKEN_1”, the feature token already extracted by the feature token extracting means 102 can be easily distinguished from other character strings. Thus, in order to prevent the feature token from being divided by the extraction processing by the non-feature token extraction means 103, it is possible to omit the processing step of checking whether or not it is a feature token. Of course, non-characteristic tokens can be extracted similarly for documents other than Japanese.

次に、ステップＳ３５において、前処理手段１００により、上記で抽出した特徴トークン、および、非特徴トークンとから成るトークン列が出力される。このトークン列のデータ出力方法としては、例えば、スペース区切りや改行区切り等を併用してトークン列を表現し、このトークン列を格納したテキストファイルを、後段の学習手段２００や分類手段３００に渡すようにする。または、トークン列を文字列の配列としてメモリに格納して、後段の学習手段２００や分類手段３００に渡すようにしても良い。 Next, in step S35, the preprocessing unit 100 outputs a token string composed of the characteristic tokens extracted above and the non-characteristic tokens. As a data output method of this token string, for example, a token string is expressed by using a space delimiter, a line feed delimiter, etc., and a text file storing this token string is passed to the subsequent learning means 200 and classification means 300. To. Alternatively, the token string may be stored in a memory as an array of character strings and passed to the learning means 200 or the classification means 300 at the subsequent stage.

以上の前処理により、入力文書からトークン列（特徴トークンと非特徴トークンとからなるシーケンス）が抽出される。このように、単語区切りの無い文章を含む文書を入力した場合にも、形態素解析に比べて処理の速い文字列照合を用いてトークン化を行なうことで、高速な処理が可能となる。 Through the above preprocessing, a token string (a sequence composed of characteristic tokens and non-characteristic tokens) is extracted from the input document. As described above, even when a document including a sentence without a word break is input, high-speed processing can be performed by performing tokenization using character string matching that is faster than morphological analysis.

ここで、図５に示す照合条件の例を参照しながら、照合条件についてさらに詳細に説明する。以下の説明では、正規表現により表現されたキーワード（正規表現キーワード）を例に取っているが、固定の用語によるキーワードでも同様に実施できる。 Here, the matching condition will be described in more detail with reference to the example of the matching condition shown in FIG. In the following explanation, a keyword expressed by a regular expression (regular expression keyword) is taken as an example, but a keyword using a fixed term can be similarly applied.

図５は、実施の形態１において、照合条件の例を示す図である。
照合条件には、各用語クラスに対して正規表現キーワードと照合条件ＩＤを組にしたものを用意し、それを複数組備えることができる。図５に示したように、例えば、人名（姓）や県名などの用語クラスについては、それに属する用語（固有名詞）を羅列する正規表現キーワードを作成することができる。また、メールアドレスや電話番号などの用語クラスについては、それらに固有の表現パターンに基づく正規表現キーワードを作成することができる。 FIG. 5 is a diagram illustrating an example of collation conditions in the first embodiment.
As a matching condition, a set of a regular expression keyword and a matching condition ID can be prepared for each term class, and a plurality of sets can be provided. As shown in FIG. 5, for a term class such as a person name (last name) or a prefecture name, for example, a regular expression keyword that lists terms (proper nouns) belonging to the class can be created. Further, for term classes such as e-mail addresses and telephone numbers, regular expression keywords based on expression patterns unique to them can be created.

ここで、機密情報を分類の対象とする場合には、機密情報の中の一つである個人情報に関して、人名と住所、電話番号、メールアドレスなどが特徴的な用語クラスとなる。さらに、名簿に使われることが多い名簿用語なども、個人情報ファイルの検出に有効である。また、一般の機密情報を検出するためには、「社外秘」などのような機密レベルを表わす用語クラスや、機密情報が含まれることの多い文書種類（仕様書など）、取引先名などの用語クラスを定義することができる。なお、図５では、用語クラスの一部の例を示したが、用語クラスとそれに属する正規表現キーワードはこれらに限られるものではない。 Here, when classified information is classified, personal name and address, telephone number, e-mail address, etc. are characteristic term classes for personal information that is one of classified information. Furthermore, name terms that are often used for names are also effective in detecting personal information files. In addition, in order to detect general confidential information, a term class indicating a confidential level such as “confidential”, a document type (such as a specification) that often contains confidential information, and a term such as a supplier name A class can be defined. Although FIG. 5 shows some examples of term classes, the term classes and regular expression keywords belonging to the term classes are not limited to these.

また、図５の組み込み照合条件４００１とユーザ定義照合条件４００２においては、照合条件ＩＤは重なりがあってもよい。すなわち、組み込み照合条件４００１にて既に定義されている用語クラスに用語を追加したい場合には、ユーザ定義照合条件４００２で同じ照合条件ＩＤを使用することにより、用語の追加ができる。図５の例では、照合条件ＩＤ＝１において、ユーザが所望する人名を追加していることを示す。 Further, in the built-in matching condition 4001 and the user-defined matching condition 4002 in FIG. 5, the matching condition ID may overlap. That is, when it is desired to add a term to a term class already defined in the built-in collation condition 4001, the term can be added by using the same collation condition ID in the user-defined collation condition 4002. In the example of FIG. 5, it is shown that the user name desired by the user is added in the matching condition ID = 1.

このように同じ照合条件ＩＤを使用することで、システム運用中に新たな用語を追加する場合でも、用語クラスとしては変わらないため、既に学習した学習データがそのまま利用できるという利点がある。もちろん、再度、学習文書５０１を学習させることで、追加された用語に関する差分情報を含めた学習結果にアップデートして保持するようにもできる。 By using the same collation condition ID in this way, even when a new term is added during system operation, it does not change as a term class, and thus there is an advantage that already learned learning data can be used as it is. Of course, by learning the learning document 501 again, the learning result including the difference information regarding the added term can be updated and held.

ユーザ定義照合条件４００２においてはさらに、新しい照合条件ＩＤを割り当てることもできる。図５の例では、用語クラス「文書種類」において、照合条件ＩＤ＝１００という組み込み照合条件４００１にない照合条件ＩＤを用いている。このことにより、ユーザが新しい用語クラスを定義して用語を追加できるようになる。 In the user-defined collation condition 4002, a new collation condition ID can also be assigned. In the example of FIG. 5, in the term class “document type”, a collation condition ID that is not included in the built-in collation condition 4001 of collation condition ID = 100 is used. This allows the user to define new term classes and add terms.

さらに、照合条件には優先順位を付けることもできる。この場合、テキスト中の文字列との照合は、優先順位の高いものから順に実行される。優先順位を付ける方法としては、例えば、照合条件として（ＩＤ、正規表現キーワード、優先順位）といった組のように、優先順位を別途管理するようにすることもできるし、あるいは、照合条件を記述した順番により、先に記述された照合条件の優先順位を高くするというようにもできる。 Furthermore, priorities can be assigned to the matching conditions. In this case, collation with the character string in the text is executed in order from the highest priority. As a method of assigning priorities, for example, it is possible to separately manage priorities such as (ID, regular expression keyword, priority) as collation conditions, or to describe collation conditions Depending on the order, the priority of the collation conditions described earlier can be increased.

さらに望ましい形態としては、照合条件ＩＤの大小により優先順位を判断することもできる。この場合、図５の例によれば、照合条件ＩＤ＝１が最も優先順位が低く、照合条件ＩＤ＝１０１が最も優先順位が高いとされる。例えば、テキスト中の「東京都」という文字列に対して照合を行なうと、ＩＤ＝１０にある「東京都」と、ＩＤ＝１にある「東」との二つがヒットするが、ＩＤ＝１０の方が優先順位が高いため、「東京都」という文字列がＴＯＫＥＮ＿１０として特徴トークン化される。人名による個人情報の検出を行なう場合には、「東」のような一文字の人名が誤検出されることによる分類精度低下が問題となるが、この照合条件の優先順位を利用することによって、人名漢字が地名や会社名などで使用されている場合の誤検出を減らすことができる。 As a more desirable mode, the priority order can be determined based on the size of the collation condition ID. In this case, according to the example of FIG. 5, the collation condition ID = 1 is the lowest priority, and the collation condition ID = 101 is the highest priority. For example, when collation is performed on the character string “Tokyo” in the text, “Tokyo” at ID = 10 and “East” at ID = 1 are hit, but ID = 10. Since the priority is higher, the character string “Tokyo” is converted into a feature token as TOKEN_10. When detecting personal information by personal name, there is a problem of degradation of classification accuracy due to erroneous detection of a single character name such as “East”. By using the priority of this matching condition, False detection when Kanji is used in place names or company names can be reduced.

このような誤検出が生じる場合の具体的な例を図６に示す。
図６は、実施の形態１において、図５の照合条件を用いて前処理手段１００により生成されたトークン列の例を示す図である。
トークン化された文字列は、１行あたり１トークンの表示形式で示されており、「鈴木」がＴＯＫＥＮ＿１、「東京都」がＴＯＫＥＮ＿１０、「東日本橋」の「東」がＴＯＫＥＮ＿１、「(０３)１１１１−２２２２」がＴＯＫＥＮ＿１２として特徴トークン化されている。また、それ以外の文字については、この例では日本語であるため、１文字が１トークンとして非特徴トークン化されている。図６の例の場合、「東日本橋」の「東」がＴＯＫＥＮ＿１として特徴トークン化されている点が誤検出となっている。このような誤検出に対応するためには、例えば、新たな用語クラス「地名」を定義して、用語として「東日本橋」を含むようにすることもできるが、全ての誤検出が回避できるわけではない。 A specific example in the case where such erroneous detection occurs is shown in FIG.
FIG. 6 is a diagram illustrating an example of a token string generated by the preprocessing unit 100 using the matching condition of FIG. 5 in the first embodiment.
The tokenized character string is shown in a display format of one token per line, “Suzuki” is TOKEN_1, “Tokyo” is TOKEN_10, “East” of “East Nihonbashi” is TOKEN_1, “(03) "1111-2222" is converted into a feature token as TOKEN_12. In addition, since the other characters are Japanese in this example, one character is converted into a non-character token as one token. In the case of the example in FIG. 6, “East Nihonbashi” “East” is converted into a characteristic token as TOKEN_1, which is a false detection. In order to deal with such a false detection, for example, a new term class “place name” can be defined to include “East Nihonbashi” as a term, but all false detections can be avoided. is not.

本実施の形態においては、前述した照合条件の優先順位付けに加え、さらに、特徴トークンと、その前後の特徴トークン、または、非特徴トークンとを関連付けることにより誤検出を防ぎ、カテゴリ分類結果への誤検出の影響を減らすことができる。例えば、図６では、「ＴＯＫＥＮ＿１、日、本」という３個のトークンのシーケンスに関して、特徴トークンＴＯＫＥＮ＿１は、実際には人名であるよりも、「東日本」という地名である可能性が高い。このような３個のトークンの連鎖確率を、前述のように、学習手段２００により、トークン列から連続する１乃至Ｎ個のトークンのシーケンスを切り出して、そのシーケンスの出現頻度を計算することで、トークンの連鎖確率を学習することができる。このようなトークンの連鎖確率を利用して学習あるいは分類することにより、特徴トークンの誤検出を防止し、分類精度を向上させることができる。 In the present embodiment, in addition to the prioritization of the matching conditions described above, the feature tokens are associated with the feature tokens before and after that or the non-feature tokens to prevent false detection and The influence of false detection can be reduced. For example, in FIG. 6, regarding a sequence of three tokens “TOKEN_1, day, book”, the characteristic token TOKEN_1 is more likely to be a place name “East Japan” than a person name. As described above, the learning unit 200 cuts the sequence of 1 to N tokens from the token string and calculates the appearance frequency of the sequence by calculating the chain probability of the three tokens as described above. The token chain probability can be learned. By learning or classifying using the token chain probability, it is possible to prevent erroneous detection of feature tokens and improve classification accuracy.

さらにまた、本実施の形態においては、特徴トークンを一つも含まない１乃至Ｎ個のトークンのシーケンスを、学習あるいは分類に使用しないことにより、学習や分類に要する時間を削減し、さらに、学習データの格納に要する記憶容量を削減することができる。１乃至Ｎ個のトークンのシーケンスが特徴トークンを含まない場合、すなわち、全て非特徴トークンである場合、そのシーケンスは、学習あるいは分類に寄与しない文字（非特徴トークン）の羅列である。特に、学習用文書５０１に含まれる文書数が比較的少ない場合には、それぞれのシーケンスのパターンは出現頻度が低く、分類に有意である可能性が低いため、学習あるいは分類に使用しなくても結果は大きく変わらない。このための具体的な実現方法としては、学習手段２００、および、分類手段３００において、特徴トークンを含まない１乃至Ｎ個のトークンのシーケンスを無視するようにしてもよいし、前処理手段１００において、特徴トークンを含む１乃至Ｎ個のトークンのシーケンスのみ出力するようにしてもよい。 Furthermore, in this embodiment, the time required for learning and classification is reduced by not using a sequence of 1 to N tokens that do not include any feature tokens for learning or classification, and further, learning data It is possible to reduce the storage capacity required for storing the. When a sequence of 1 to N tokens does not include a feature token, that is, all are non-feature tokens, the sequence is an enumeration of characters (non-feature tokens) that do not contribute to learning or classification. In particular, when the number of documents included in the learning document 501 is relatively small, each sequence pattern has a low appearance frequency and is unlikely to be significant for classification. The result is not much different. As a specific implementation method for this, the learning means 200 and the classification means 300 may ignore a sequence of 1 to N tokens that do not include a feature token, or the preprocessing means 100 Only a sequence of 1 to N tokens including feature tokens may be output.

以上のように、実施の形態１によれば、抽出すべき特徴トークンを定義した照合条件を用いて、入力文書から特徴トークンと非特徴トークンとからなるシーケンスを抽出し、照合条件の優先順位付けやトークンの連鎖確率を利用して学習あるいは分類するようにしたので、単語区切りの無い文章を含む文書を入力した場合にも、形態素解析に比べて処理の速い文字列照合を用いてトークン化が行なえるので高速な処理が可能となり、さらに、特徴トークンの誤検出を防止して、分類精度を向上させることができるようになるという効果がある。 As described above, according to the first embodiment, using a matching condition that defines a feature token to be extracted, a sequence of feature tokens and non-feature tokens is extracted from an input document, and prioritization of matching conditions is performed. Learning or classification using token chain probability, tokenization can be performed using string matching, which is faster than morphological analysis even when a document containing sentences without word breaks is input. Therefore, it is possible to perform high-speed processing, and further, it is possible to prevent erroneous detection of feature tokens and improve classification accuracy.

実施の形態２．
以上の実施の形態１では、特徴トークンの抽出を文字列の置換機能により実現するようにしたものであるが、次に、特徴トークンの抽出を正規表現の照合機能により実現する実施の形態を示す。 Embodiment 2. FIG.
In the first embodiment described above, feature token extraction is realized by a character string replacement function. Next, an embodiment in which feature token extraction is realized by a regular expression matching function will be described. .

正規表現を照合条件とする文字列照合方法として、ＮＦＡ（ＮｏｎＤｅｔｅｒｍｉｎｉｓｔｉｃＦｉｎｉｔｅＡｕｔｏｍａｔｏｎ、非決定性有限オートマトン）による方法や、ＤＦＡ（ＤｅｔｅｒｍｉｎｉｓｔｉｃＦｉｎｉｔｅＡｕｔｏｍａｔｏｎ、決定性有限オートマトン）による方法が知られている。ＤＦＡの場合には、正規表現をコンパイルして状態遷移表を生成し、その状態遷移表を入力文字列に適用することで照合を行なうため、ＮＦＡに比べて高速に照合できることが知られている。この発明においては、いずれの方法を用いても実施可能であるが、以下の説明ではＤＦＡを用いる例を示す。 As a character string matching method using a regular expression as a matching condition, a method using NFA (Non Deterministic Finite Automaton) or a method using DFA (Deterministic Finite Automaton) is known. In the case of DFA, a regular expression is compiled to generate a state transition table, and collation is performed by applying the state transition table to an input character string. Therefore, it is known that collation can be performed faster than NFA. . Although any method can be used in the present invention, an example using DFA is shown in the following description.

図７は、実施の形態２において、ＤＦＡを用いた場合の特徴トークン抽出手段１０２に関わる構成の一例を示す図である。
本実施の形態２では、照合条件記憶手段４００に記憶された照合条件から状態遷移表１０５を生成する状態遷移表生成手段１０４を備え、特徴トークン抽出手段１０２は、状態遷移表１０５を参照して入力文字列１０２０との照合を行なう照合手段１０２１と、照合結果から文字列の置き換えを行なう置換手段１０２２とにより構成される。 FIG. 7 is a diagram illustrating an example of a configuration related to the feature token extracting unit 102 in the case where the DFA is used in the second embodiment.
In the second embodiment, the state transition table generation unit 104 that generates the state transition table 105 from the collation condition stored in the collation condition storage unit 400 is provided. The feature token extraction unit 102 refers to the state transition table 105. It comprises collation means 1021 for collating with the input character string 1020 and replacement means 1022 for replacing the character string from the collation result.

図7において、状態遷移表生成手段１０４は、照合条件に含まれる全ての文字列を解析し、ＤＦＡで文字列を受理するまでの解析過程を状態遷移で表現して状態遷移表１０５を生成する。状態遷移表１０５を生成する方法は一般に良く知られた方法であり、このような既存の方法を状態遷移表生成手段１０４で用いても構わない。 In FIG. 7, the state transition table generation unit 104 analyzes all character strings included in the collation condition, and expresses the analysis process until the character string is received by the DFA, and generates the state transition table 105. . The method for generating the state transition table 105 is a generally well-known method, and such an existing method may be used in the state transition table generating unit 104.

照合手段１０２１は、入力文字列１０２０（テキスト抽出手段１０１により抽出されたテキスト）が１文字づつ入力されて状態遷移表１０５による照合を行ない、照合条件の合致を判定する。合致した場合には、合致した文字列の終了位置（入力文字列１０２０の先頭からの位置であり、ここではヒット位置と呼ぶ）、及び、合致した照合条件ＩＤを出力する。さらに、照合手段１０２１では、合致した文字列の開始位置（入力文字列１０２０の先頭からの位置）を算出し、置換手段１０２２において、合致した文字列の開始位置、及び、終了位置（ヒット位置）に基づいて、特徴トークンへの置換を行なうことができる。 The collating unit 1021 receives the input character string 1020 (text extracted by the text extracting unit 101) one character at a time and performs collation using the state transition table 105 to determine whether the collation condition is met. If they match, the end position of the matched character string (the position from the beginning of the input character string 1020, referred to here as the hit position) and the matched matching condition ID are output. Further, the matching unit 1021 calculates the start position of the matched character string (position from the beginning of the input character string 1020), and the replacement unit 1022 calculates the start position and end position (hit position) of the matched character string. Can be replaced with feature tokens.

あるいは、照合手段１０２１において開始位置は算出せず、終了位置（ヒット位置）のみを算出するようにもできる。そうすることにより、開始位置を算出するために状態遷移表１０５を逆にたどったり、状態遷移表１０５の各状態で開始位置を管理したりするなどといった処理が不要となり、照合を高速化できる。 Alternatively, the collating unit 1021 can calculate only the end position (hit position) without calculating the start position. By doing so, processing such as tracing back the state transition table 105 in order to calculate the start position or managing the start position in each state of the state transition table 105 becomes unnecessary, and collation can be speeded up.

この場合、置換手段１０２２は以下のように動作する。
照合条件には、人名や地名、会社名などのような固有名詞を用語とする用語クラスと、メールアドレスのように可変長の用語に対して合致する正規表現キーワードを持つ用語クラスとがある。固有名詞の羅列となるような照合条件に対しては、照合条件記憶手段４００において、用語の長さごとに照合条件を分けて、照合条件ＩＤごとに一意に長さが決まるようにできる。したがって、照合手段により合致した照合条件ＩＤから、対応する長さの文字列を、対応する特徴トークンに置き換えることができる。可変長の用語を持つ用語クラスに対しては、置き換えを行なわずに、終了位置（ヒット位置）に続けて特徴トークンを挿入するように構成できる。 In this case, the replacement unit 1022 operates as follows.
The matching conditions include a term class having a proper noun as a term such as a person name, a place name, and a company name, and a term class having a regular expression keyword that matches a variable-length term such as an email address. For collation conditions that form a list of proper nouns, the collation condition storage means 400 can divide the collation conditions for each term length and uniquely determine the length for each collation condition ID. Therefore, the character string having the corresponding length can be replaced with the corresponding feature token from the matching condition ID matched by the matching means. For a term class having a variable-length term, a feature token can be inserted after the end position (hit position) without replacement.

これにより、開始位置がわからない状態においても、文字列の置換を行なうことができ、置換処理を高速化できる。 As a result, even when the starting position is unknown, the character string can be replaced, and the replacement process can be speeded up.

図８は、実施の形態２において、用語の長さを分けた照合条件の例を示す図である。
照合条件ＩＤを、用語クラスを識別する用語クラスフィールドと、文字数を識別する文字数フィールドとから成るように構成する。図８において、ＩＤ＝１０１、１０２、１０３は用語クラスフィールドが１で文字数がそれぞれ１，２，３であることを示す。置換手段１０２２においては、用語クラスフィールドを用いて特徴トークンの生成（この場合は、ＴＯＫＥＮ＿１）を行ない、文字数フィールドに示される文字を入力文字列から取り除いて、特徴トークンを挿入する。ＩＤ＝１１００については、用語クラスフィールドが１１、文字数フィールドが０となる。文字数フィールド０は可変長であることを示し、この場合、置換手段１０２２は、入力文字列からの合致文字列の削除を行なわずに、特徴トークン（この場合は、ＴＯＫＥＮ＿１１）の挿入を行なう。図８の例では、ＩＤの１の位と１０の位を文字数フィールド、それ以上の位を用語クラスフィールドとしたが、フィールドの割り当てについてはこれに限るものではない。 FIG. 8 is a diagram illustrating an example of collation conditions in which term lengths are divided in the second embodiment.
The matching condition ID is configured to include a term class field for identifying a term class and a character number field for identifying the number of characters. In FIG. 8, ID = 101, 102, 103 indicates that the term class field is 1, and the number of characters is 1, 2, 3, respectively. The replacement unit 1022 generates a feature token using the term class field (in this case, TOKEN_1), removes the character indicated in the character number field from the input character string, and inserts the feature token. For ID = 1100, the term class field is 11 and the character count field is 0. The number-of-characters field 0 indicates that it has a variable length. In this case, the replacement unit 1022 inserts a feature token (in this case, TOKEN_11) without deleting the matching character string from the input character string. In the example of FIG. 8, the 1st place and 10th place of the ID are the number-of-characters field, and the higher place is the term class field. However, field assignment is not limited to this.

このように、照合条件ＩＤの一部に文字数情報を持たせることにより、文字数を別に管理する場合の煩雑さを無くすことができる。 In this way, by providing character number information as part of the collation condition ID, it is possible to eliminate the complexity of managing the number of characters separately.

上記のように照合条件ＩＤに文字数情報を持たせる場合、ユーザが文字数を意識しながら照合条件を作成するのは手間がかかるため、照合条件を自動構成するようにすることもできる。
図９は、実施の形態２において、照合条件を自動構成する構成の例を示す図である。
図９の例では、照合条件合成手段１０６が追加されている。照合条件合成手段１０６は、文字数ごとに分かれていない図５のような照合条件を入力として、正規表現キーワードを解析し、文字数ごとに分けた照合条件を生成する。正規表現キーワードが人名のように固定長文字の羅列である場合には文字数ごとに分けた正規表現キーワードを生成し、それぞれの正規表現キーワードの文字数と元の照合条件ＩＤから、新しい照合条件ＩＤ（用語クラスフィールドと文字数フィールドをもつもの）を生成する。正規表現キーワードが可変長である場合には、照合条件ＩＤの文字数フィールドに０を入れる。 As described above, when the collation condition ID is provided with the number-of-characters information, it takes time and effort for the user to create the collation condition while paying attention to the number of characters. Therefore, the collation condition can be automatically configured.
FIG. 9 is a diagram showing an example of a configuration for automatically configuring the collation conditions in the second embodiment.
In the example of FIG. 9, a matching condition synthesis unit 106 is added. The matching condition synthesizing means 106 receives the matching condition as shown in FIG. 5 that is not divided for each number of characters, analyzes the regular expression keyword, and generates a matching condition divided for each number of characters. When the regular expression keyword is an enumeration of fixed-length characters such as a person name, a regular expression keyword divided for each number of characters is generated, and a new matching condition ID (from the number of characters of each regular expression keyword and the original matching condition ID ( With term class field and number of characters field). If the regular expression keyword has a variable length, 0 is entered in the number of characters field of the matching condition ID.

このように、文字数情報を持った照合条件を自動生成することで、ユーザが文字数を意識した照合条件を作成する手間を省くことができる。 As described above, by automatically generating the collation condition having the character number information, it is possible to save the user from creating the collation condition in consideration of the number of characters.

実施の形態３．
以上の実施の形態２は、照合条件に合致した文字列の入力テキスト中の位置のみを出力する照合手段１０２１を用いた場合でも特徴トークンへの置換が行なえるようにするものであるが、次に、特徴的な用語同士の位置関係をトークン化する場合の実施の形態を示す。 Embodiment 3 FIG.
In the second embodiment described above, even when the matching unit 1021 that outputs only the position in the input text of the character string that matches the matching condition is used, the feature token can be replaced. An embodiment for tokenizing the positional relationship between characteristic terms is shown below.

個人情報の検出においては、名簿のように人名と住所、電話番号、メールアドレスなどが並んでいる構造を持つ文書を検出対象にする場合も多い。このような特徴的な用語同士の位置関係を抽出して学習することで、分類精度を上げることができる。 In detecting personal information, a document having a structure in which a person's name and address, a telephone number, a mail address, and the like are arranged like a name list is often set as a detection target. The classification accuracy can be improved by extracting and learning the positional relationship between such characteristic terms.

この実施の形態３における前処理手段１００は、照合条件ＩＤで指定される特徴的な用語の入力文書中のヒット位置を抽出して記憶し、２つ以上の特徴的な用語のシーケンスで予め定めたルールに合致するシーケンスが現れた場合に、それらの特徴間の距離を、記憶されたヒット位置により算出し、該当するルールにつけられたルールＩＤと距離とを合わせて特徴間距離トークンを生成する。学習手段２００、及び、分類手段３００においては、この特徴間距離トークンをその他のトークンと同じように学習することで、名簿のような構造を持った文書を効率的に学習・分類できる。 The preprocessing means 100 in the third embodiment extracts and stores the hit position in the input document of the characteristic term specified by the collation condition ID, and predetermines it in a sequence of two or more characteristic terms. When a sequence that matches the rule appears, the distance between the features is calculated from the stored hit position, and the feature distance token is generated by combining the rule ID and distance assigned to the corresponding rule. . The learning means 200 and the classification means 300 can learn and classify a document having a structure like a name list efficiently by learning the distance token between features in the same manner as other tokens.

図１０は、実施の形態３における情報検索システムの一例を示す構成図である。
図１０では、前処理手段１００において、特徴間距離トークン生成手段１０７が追加され、さらにルール６００が追加される構成になっている。 FIG. 10 is a configuration diagram illustrating an example of an information search system according to the third embodiment.
In FIG. 10, in the preprocessing unit 100, a feature distance token generation unit 107 is added, and a rule 600 is further added.

図１１は、実施の形態３において、ルール６００の例を示す図である。
一つのルールは、ルールＩＤと照合条件シーケンスとの組から構成される。図１１のＩＤ＝１は、照合条件ＩＤが１、１１の順で現れるもの、すなわち、人名→メールアドレスの順に現れるものを検出するためのルールである。これを検出した場合、特徴間距離トークンとしてルールＩＤ＝１であることが識別可能であり、このＩＤと、検出したものの距離とを含むような特徴間距離トークンを、特徴間距離トークン生成手段１０５にて生成する。例えば、距離が２０であった場合には、ＲＵＬＥ＿１＿２０というトークンを生成する。 FIG. 11 is a diagram illustrating an example of the rule 600 in the third embodiment.
One rule is composed of a set of a rule ID and a collation condition sequence. ID = 1 in FIG. 11 is a rule for detecting the case where the collation condition ID appears in the order of 1 and 11, that is, the case where the collation condition ID appears in the order of person name → mail address. When this is detected, it is possible to identify that rule ID = 1 as the inter-feature distance token, and the inter-feature distance token including this ID and the detected distance is used as the inter-feature distance token generation means 105. Generate with. For example, if the distance is 20, a token RULE_1_20 is generated.

照合条件シーケンスは、２つ以上の照合条件の順序を示すものであり、ＩＤ＝３の例の場合には、３つの照合条件のシーケンスが記述される。この場合、距離の部分は、例えば、照合条件ＩＤの１と１０の間が５であって、１０と１２の間が６であれば、ＲＵＬＥ＿３＿５＿６などとすればよい。４つ以上のシーケンスも同様である。 The collation condition sequence indicates the order of two or more collation conditions. In the example of ID = 3, three collation condition sequences are described. In this case, the distance portion may be, for example, RULE_3_5_6 if the distance between the collation condition IDs 1 and 10 is 5 and the distance between 10 and 12 is 6. The same applies to four or more sequences.

距離の表記については、上限を設けることもできる。通常、名簿などにおいても一定以上離れた特徴的な用語は無関係である可能性が高く、関連する特徴的な用語は構造的に隣り合っている可能性が高い。構造的といったのは文書の構造という意味で、それをテキスト抽出したときに必ずしも文字列として隣り合っているわけではないが、ある程度の距離以内にはある可能性が高い。したがって、距離が条件を超える場合には無視しても影響は少ない。 An upper limit can be set for the notation of the distance. Usually, even in a directory or the like, characteristic terms that are more than a certain distance are likely to be irrelevant, and related characteristic terms are likely to be structurally adjacent. “Structural” means the structure of a document. When text is extracted, it is not necessarily adjacent as a character string, but is likely to be within a certain distance. Therefore, if the distance exceeds the condition, the influence is small even if ignored.

また、距離は、必ずしも正確な距離である必要はなく、距離をいくつかの範囲に分けてラベルを付けてもよい。例えば、距離が５以下ならＡ、６以上１０以下ならＢ、１１以上２０以下ならＣ、それ以外は対象外、などとすることができる。検出された距離が６であれば、トークンはＲＵＬＥ＿１＿Ｂのように表される。このように距離を範囲にまとめることで、比較的少ない学習文書により効率的に学習できる。 Further, the distance is not necessarily an accurate distance, and the distance may be divided into several ranges and labeled. For example, the distance can be A if the distance is 5 or less, B if the distance is 6 or more and 10 or less, C if the distance is 11 or more and 20 or less, and otherwise. If the detected distance is 6, the token is represented as RULE_1_B. In this way, by collecting distances in a range, it is possible to efficiently learn with relatively few learning documents.

また、照合条件記憶手段４００に記憶される照合条件が、組み込み照合条件４００１とユーザ定義照合条件４００２とにより構成される場合には、これらに対応して、ルール６００を、組み込みルールとユーザ定義ルールとにより構成することで、ユーザが定義した用語クラスを含むルールの作成ができるようになる。 When the collation condition stored in the collation condition storage unit 400 is composed of the built-in collation condition 4001 and the user-defined collation condition 4002, the rule 600 is associated with the built-in rule and the user-defined rule. This makes it possible to create a rule including a term class defined by the user.

ところで、実施の形態１で述べたように、この発明では、特徴トークンを含む１乃至Ｎ個のトークンのシーケンスについて学習あるいは分類するが、特徴間距離トークンに関しては、このようなシーケンスを学習することは意味が無い。これを回避する一つの方法としては、学習手段２００、及び、分類手段３００において特徴間距離トークンを検出した場合には、シーケンスではなく、単体（Ｎ＝１）で学習・分類するようにする。 By the way, as described in the first embodiment, in the present invention, a sequence of 1 to N tokens including a feature token is learned or classified, but such a sequence is learned for an inter-feature distance token. Is meaningless. As one method for avoiding this, when the learning unit 200 and the classifying unit 300 detect the feature distance token, learning and classification are performed not by the sequence but by a single unit (N = 1).

あるいは別の方法として、前処理手段１００が特徴間距離トークンを出力する場合に、前後にＮ−１個のダミートークンを挿入し、学習・分類においては、そのままＮ個のシーケンスを学習するようにも構成できる。例えばＮ＝３とすると、トークン列はＤＵＭＭＹ、ＤＵＭＭＹ、ＲＵＬＥ＿Ｘ＿Ｘ、ＤＵＭＭＹ、ＤＵＭＭＹとなり、連続するＮ個のトークンの組合せとしては、
（ＤＵＭＭＹ、ＤＵＭＭＹ、ＲＵＬＥ＿Ｘ＿Ｘ）
（ＤＵＭＭＹ、ＲＵＬＥ＿Ｘ＿Ｘ、ＤＵＭＭＹ）
（ＲＵＬＥ＿Ｘ＿Ｘ、ＤＵＭＭＹ、ＤＵＭＭＹ）
となる。このようにして、特徴間距離トークンに関して隣り合う特徴間距離トークンとの関連性を学習することを回避できる。 Alternatively, as another method, when the preprocessing unit 100 outputs a distance token between features, N-1 dummy tokens are inserted before and after and N sequences are learned as they are in learning and classification. Can also be configured. For example, if N = 3, the token string is DUMMY, DUMMY, RULE_X_X, DUMMY, DUMMY, and the combination of N consecutive tokens is as follows:
(DUMMY, DUMMY, RULE_X_X)
(DUMMY, RULE_X_X, DUMMY)
(RULE_X_X, DUMMY, DUMMY)
It becomes. In this way, it is possible to avoid learning the relationship between adjacent feature distance tokens with respect to the feature distance token.

実施の形態４．
以上の実施の形態３は、特徴的な用語同士の位置関係を特徴間距離トークン化してトークン列に追加し、学習・分類に使用するものであるが、次に、特徴間距離トークンのみによる学習を行なって、学習処理を高速化する場合の実施の形態を示す。 Embodiment 4 FIG.
In the third embodiment described above, the positional relationship between characteristic terms is converted into a distance token between features and added to the token string for use in learning / classification. Next, learning using only the distance token between features is performed. An embodiment in which the learning process is speeded up by performing the above will be described.

図１２は、実施の形態４における情報検索システムの一例を示す構成図である。
図１２では、前処理手段１００にトークン出力制御手段１０６が追加された構成となっている。トークン出力制御手段１０８は、２つの動作モード（モード１、モード２）を備えている。モード１が設定されると、実施の形態３と同様に、特徴トークンと非特徴トークンとからなるトークン列の最後に、特徴間距離トークンと、必要に応じてダミートークンとを出力する。また、モード２が設定されると、特徴トークンと非特徴トークンの出力を抑止して、特徴間距離トークンのみを出力する。 FIG. 12 is a configuration diagram illustrating an example of an information search system according to the fourth embodiment.
In FIG. 12, a token output control means 106 is added to the preprocessing means 100. The token output control means 108 has two operation modes (mode 1 and mode 2). When mode 1 is set, as in the third embodiment, an inter-feature distance token and, if necessary, a dummy token are output at the end of the token string composed of characteristic tokens and non-characteristic tokens. When mode 2 is set, output of feature tokens and non-feature tokens is suppressed, and only feature distance tokens are output.

上記の動作モードの設定は、初期設定ファイル、あるいは、レジストリに設定するか、システムの実行時に実行コマンドパラメータとして与える。トークン出力制御手段１０８は、起動時にこれらの設定方法により動作モードが設定され、以後、その動作モードにて動作する。 The operation mode is set in an initial setting file or registry, or given as an execution command parameter when the system is executed. The token output control means 108 is set in the operation mode by these setting methods at the time of activation, and thereafter operates in the operation mode.

モード１に設定された場合には、トークン出力制御手段１０８は、非特徴トークン抽出手段１０３からの出力、および、特徴間距離トークン生成手段１０７からの出力をそのまま出力する。 When the mode 1 is set, the token output control unit 108 outputs the output from the non-feature token extraction unit 103 and the output from the feature distance token generation unit 107 as they are.

モード２に設定された場合には、トークン出力制御手段１０８は、非特徴トークン抽出手段１０３から出力されるトークン列（特徴トークンと非特徴トークンからなる列）は出力せず、特徴間距離トークン生成手段１０７により生成されたトークン列（特徴間距離トークンからなる列）を出力する。なお、このトークン列には、ダミートークンが含まれていても良い。モード２にて動作する場合には、隣り合うトークンとの関係の学習は不要であると設定されているため、特徴間距離トークン生成手段１０７、学習手段２００、及び、分類手段３００にて、Ｎ＝１として動作することで、無駄な学習を省いてより効率的に動作できる。 When the mode 2 is set, the token output control unit 108 does not output the token sequence output from the non-feature token extraction unit 103 (sequence including the feature token and the non-feature token), and generates an inter-feature distance token. The token sequence generated by the means 107 (sequence consisting of distance tokens between features) is output. This token string may include a dummy token. In the case of operating in mode 2, since it is set that learning of the relationship between adjacent tokens is not necessary, the inter-feature distance token generating unit 107, the learning unit 200, and the classification unit 300 perform N By operating at = 1, it is possible to operate more efficiently without unnecessary learning.

このように、学習処理の動作モードを選択可能とすることで、分類精度と分類速度とのトレードオフを、アプリケーションのタイプに応じて選択することが可能な柔軟性のあるシステムを提供できる。 As described above, by making it possible to select the operation mode of the learning process, it is possible to provide a flexible system capable of selecting the trade-off between the classification accuracy and the classification speed according to the application type.

実施の形態５．
実施の形態３は、トークンをすべて平等に扱うものであったが、次に、トークンの重み付けを設定可能とする場合の実施の形態を示す。 Embodiment 5 FIG.
In the third embodiment, all tokens are handled equally. Next, an embodiment in which the token weighting can be set will be described.

図１３は、実施の形態５において、トークンの重み付け設定方法の例を示す図である。
図１３において、種別は、特徴トークン（ＴＯＫＥＮ）であるか、特徴間距離トークン（ＲＵＬＥ）であるかを示す。分類手段３００は、この重み付け設定情報を持ち、特徴トークンと非特徴トークンとから算出された入力文書の各カテゴリへの分類確率と、特徴間距離トークンから算出された分類確率とを、この重みに応じて配分して最終的な分類確率を計算し、カテゴリ判定を行なう。 FIG. 13 is a diagram illustrating an example of a token weighting setting method according to the fifth embodiment.
In FIG. 13, the type indicates whether it is a feature token (TOKEN) or an inter-feature distance token (RULE). The classification means 300 has this weight setting information, and uses the classification probability for each category of the input document calculated from the feature token and the non-feature token and the classification probability calculated from the inter-feature distance token as the weight. According to the allocation, the final classification probability is calculated and the category is determined.

本実施の形態５の分類手段３００においては、トークンの種別により、それぞれ別の分類確率Ｐｃｔ（特徴トークンにより入力文書がカテゴリｃに分類される確率）、及び、Ｐｃｒ（特徴間距離トークンにより入力文書がカテゴリｃに分類される確率）を算出する。このとき、入力文書がカテゴリｃに分類される確率Ｐｃを以下の式にて算出し、これによりカテゴリを判定する。
Ｐｃ＝Ｐｃｔ・Ｗｔ＋Ｐｃｒ・Ｗｒ（但し、Ｗｔ＋Ｗｒ＝１とする） In the classifying means 300 according to the fifth embodiment, different classification probabilities Pct (probability that the input document is classified into category c by the feature token) and Pcr (input document by the inter-feature distance token) depending on the token type. Is probable to be classified into category c). At this time, the probability Pc that the input document is classified into the category c is calculated by the following formula, and the category is determined by this.
Pc = Pct · Wt + Pcr · Wr (Wt + Wr = 1)

これにより、アプリケーションのタイプに応じて、分類精度をきめ細かくチューニングすることが可能となる。 This makes it possible to finely tune the classification accuracy according to the type of application.

実施の形態１における情報検索システムの一例を示す構成図である。1 is a configuration diagram illustrating an example of an information search system in Embodiment 1. FIG. 実施の形態１における情報検索システムの動作を示すフローチャートである。3 is a flowchart showing an operation of the information search system in the first embodiment. 実施の形態１において、分類段階の動作を示すフローチャートである。5 is a flowchart illustrating an operation in a classification stage in the first embodiment. 実施の形態１において、前処理手段１００の動作を示すフローチャートである。4 is a flowchart showing the operation of preprocessing means 100 in the first embodiment. 実施の形態１において、照合条件の例を示す図である。In Embodiment 1, it is a figure which shows the example of collation conditions. 実施の形態１において、図５の照合条件を用いて前処理手段１００により生成されたトークン列の例を示す図である。In Embodiment 1, it is a figure which shows the example of the token row | line | column produced | generated by the pre-processing means 100 using the collation conditions of FIG. 実施の形態２において、ＤＦＡを用いた場合の特徴トークン抽出手段１０２に関わる構成の一例を示す図である。In Embodiment 2, it is a figure which shows an example of the structure in connection with the feature token extraction means 102 at the time of using DFA. 実施の形態２において、用語の長さを分けた照合条件の例を示す図である。In Embodiment 2, it is a figure which shows the example of the collation conditions which divided | segmented the length of the term. 実施の形態２において、照合条件を自動構成する構成の例を示す図である。In Embodiment 2, it is a figure which shows the example of the structure which automatically comprises collation conditions. 実施の形態３における情報検索システムの一例を示す構成図である。FIG. 10 is a configuration diagram illustrating an example of an information search system in a third embodiment. 実施の形態３において、ルール６００の例を示す図である。In Embodiment 3, it is a figure which shows the example of the rule 600. FIG. 実施の形態４における情報検索システムの一例を示す構成図である。FIG. 10 is a configuration diagram illustrating an example of an information search system in a fourth embodiment. 実施の形態５において、トークンの重み付け設定方法の例を示す図である。In Embodiment 5, it is a figure which shows the example of the weight setting method of a token.

Explanation of symbols

１００前処理手段、１０１テキスト抽出手段、１０２特徴トークン抽出手段、１０２０入力文字列、１０２１照合手段、１０２２置換手段、１０２３出力文字列、１０３非特徴トークン抽出手段、１０４状態遷移表生成手段、１０５状態遷移表、１０６照合条件生成手段、１０７特徴間距離トークン生成手段、１０８トークン出力制御手段、２００学習手段、２０１学習用頻度計算手段、２０２学習頻度記憶手段、３００分類手段、３０１分類用頻度計算手段、３０２分類確率算出手段、３０３カテゴリ判定手段、４００照合条件記憶手段、４００１組み込み照合条件、４００２ユーザ定義照合条件、５０１学習用文書、５０２分類対象文書、６００ルール。 100 Pre-processing means, 101 Text extraction means, 102 Feature token extraction means, 1020 Input character string, 1021 Collation means, 1022 Replacement means, 1023 Output character string, 103 Non-characteristic token extraction means, 104 State transition table generation means, 105 State Transition table, 106 collation condition generation means, 107 feature distance token generation means, 108 token output control means, 200 learning means, 201 learning frequency calculation means, 202 learning frequency storage means, 300 classification means, 301 classification frequency calculation means 302 classification probability calculation means, 303 category determination means, 400 collation condition storage means, 4001 built-in collation conditions, 4002 user-defined collation conditions, 501 learning document, 502 classification target document, 600 rules.

Claims

A matching condition storage means for storing a matching condition between a character string and a keyword and a feature token for identifying the matching condition;
Based on the matching condition and the feature token stored in the matching condition storage unit, the character string of the learning document classified in advance by category is matched with the keyword to correspond to the matched matching condition. The first feature token is extracted in association with the category, and the character string of the classification target document classified by the category based on the matching condition and the feature token stored in the matching condition storage unit And a feature token extracting means for extracting a second feature token corresponding to the matched matching condition;
A first non-feature token obtained by dividing a character string of the learning document from which the first feature token has not been extracted in character units is extracted in association with the category, and the second feature token is extracted. Non-characteristic token extracting means for extracting a second non-characteristic token obtained by dividing the character string of the classification target document that has not been divided into character units;
Learning means for calculating an appearance frequency of a first token string composed of the first characteristic token and the first non-characteristic token in association with the category as a learning frequency;
The classification probability indicating the similarity between the appearance frequency of the second token sequence composed of the second characteristic token and the second non-characteristic token and the learning frequency calculated by the learning means is the category. Classifying means for separately calculating and classifying the classification target document into the category in which the classification probability exceeds a predetermined threshold;
Information retrieval system with

The learning means performs the first feature token only when the first feature token is included in a first token chain composed of consecutive n (n is a natural number) tokens in the first token sequence. Calculating the occurrence frequency of the token chain as the learning frequency,
The classifying means performs the second feature token only when the second feature token is included in a second token chain composed of consecutive n (n is a natural number) tokens in the second token sequence. The information search system according to claim 1, wherein the classification probability indicating a similarity between a token chain appearance frequency and the learning frequency calculated by the learning unit is calculated for each category.

The collation condition storage means stores a priority setting collation condition in which a priority is set for the collation condition,
The information search system according to claim 1 or 2, wherein the feature token extraction unit extracts the first or second feature token based on the priority setting collation condition and the feature token.

The collation condition storage means stores a built-in collation condition in which the collation condition is defined in advance and a user-defined collation condition defined by the user,
The information according to any one of claims 1 to 3, wherein the feature token extraction unit extracts the first or second feature token based on the built-in matching condition, the user-defined matching condition, and the feature token. Search system.

The matching condition storage means stores a regular expression matching condition in which the matching condition is defined by a regular expression,
The information search system according to any one of claims 1 to 4, wherein the feature token extracting unit extracts the first or second feature token based on the regular expression matching condition and the feature token.

The matching condition storage means stores, in the matching condition, a term class matching condition in which a plurality of the keywords are defined for each term class indicating the classification of the keyword,
The information search system according to claim 1, wherein the feature token extraction unit extracts the first or second feature token based on the term class matching condition and the feature token.

The collation condition storage means stores an ID addition collation condition in which a collation condition ID that is an identification number of the regular expression collation condition is given to the regular expression collation condition,
The feature token extracting means collates the character string in the document with the keyword based on the ID addition collation condition, and matches the collation condition ID of the matched ID provision collation condition and the end position of the matched character string. Matching means for outputting a hit position indicating
Replacement that replaces the character string that matches the matching condition with the first or second feature token corresponding to the ID addition matching condition based on the matching condition ID and the hit position output by the matching unit The information search system according to claim 5, further comprising: means.

8. The information retrieval system according to claim 7, wherein the collation means performs character string collation using a deterministic finite automaton.

8. The information retrieval system according to claim 7, wherein the collation means performs character string collation using a nondeterministic finite automaton.

The collation condition storage means includes a term class field that holds the identification number of the term class in the collation condition ID and a character information field that holds the number of characters of the keyword in the ID addition collation condition. Remember
The replacement means removes a character string corresponding to the number of characters held in the character number field of the collation condition ID of the field information addition collation condition from the front of the hit position, and the first or second corresponding to the collation condition ID. The information search system according to claim 7, wherein two feature tokens are inserted.

The field information addition collation condition includes a plurality of fixed-length keywords,
The field information addition collation condition is divided for each fixed-length keyword, and the field information addition collation conditions in which the number of characters of the fixed-length keyword is the same among the divided field information addition collation conditions are collectively added as new field information. The information search system according to claim 10, further comprising collation condition synthesis means for synthesizing the collation conditions.

The collation condition storage means stores a variable length information setting collation condition in which variable length information indicating variable length is set in the character number field corresponding to the variable length keyword when the keyword has a variable length. And
When the variable length information is set in the number-of-characters field of the variable length information setting collation condition, the replacement means does not remove the character string matched with the variable length keyword from before the hit position, and The information search system according to claim 10, wherein the first or second feature token is inserted immediately after a position.

A rule that stores a matching condition sequence that defines an order relationship among a plurality of matching condition IDs and a rule ID that is an identifier of the matching condition sequence;
Analyzing the collation condition ID and the hit position output by the collation means, detecting a chain of the collation condition IDs that appear in an order relationship that matches the collation condition sequence stored in the rule, An inter-feature distance token generating means for generating an inter-feature distance token that is an identifier combining the distance between the hit positions in the chain of matching condition IDs and the rule ID of the rule that matches the matching condition sequence;
The learning means learns the learning frequency that is the appearance frequency for each category based on the appearance frequency of the feature distance token,
10. The classification means calculates a classification probability into the category based on an appearance frequency of the feature distance token, and classifies the classification target document into the category in which the classification probability exceeds a predetermined threshold. The information search system described in any of the above.

The feature distance token generated by the feature distance token generation unit, the feature token extracted by the feature token extraction unit, and the non-feature token extracted by the non-feature token extraction unit are selectively selected. If the operation mode condition for outputting to the feature token is set, and the operation mode condition is set to suppress the output of the feature token and the non-feature token, control is performed so that only the distance token between features is output. The information retrieval system according to claim 13, further comprising a token output control unit.

The classification means includes a first classification probability for the category calculated based on the feature token and the non-feature token, and a second classification for the category calculated based on the inter-feature distance token. The information search system according to claim 13, wherein the classification target document is classified into the category by using a third classification probability obtained by weighting and combining the probabilities.