JP2007304950A

JP2007304950A - Document processing device and document processing method

Info

Publication number: JP2007304950A
Application number: JP2006133828A
Authority: JP
Inventors: Seiji Kashimoto; 誠司樫本
Original assignee: JustSystems Corp
Current assignee: JustSystems Corp
Priority date: 2006-05-12
Filing date: 2006-05-12
Publication date: 2007-11-22

Abstract

<P>PROBLEM TO BE SOLVED: To improve information extraction accuracy from a document file. <P>SOLUTION: A word of a learning corpus 200 is classified into one of a plurality of classes. The document processing device 100 holds the identity of the word in the learning corpus 200 as class identity information in each class in a class identity holding section 170. The document processing device 100 extracts a word from a document 210 to be inspected before processing, calculates the identity of the word and adaptation of the class identity information in the document 210 to be inspected before processing for each of the plurality of classes, adjusts the adaptation calculated for a predetermined class, and specifies the class corresponding to the word extracted based on the adaptation for each class. The specified class name is added as a tag, thereby creating a document 212 to be inspected after processing. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は、文書ファイルからの情報抽出技術に関し、特に、固有表現抽出技術に関する。 The present invention relates to an information extraction technique from a document file, and more particularly to a specific expression extraction technique.

コンピュータの普及とネットワーク技術の進展にともない、ネットワークを介した電子情報の交換が盛んになっている。これにより、従来においては紙ベースで行われていた事務処理の多くが、ネットワークベースの処理に置き換えられつつある。デジタル化とネットワーク技術の進展は、情報取得コストを急激に低下させている。このような状況において、個々の文書の中から中心的な情報を抜き出す情報抽出（Information Extraction）技術、中でも、文書中から人名、会社名、年月日、役職名といった固有表現（Named Entity）を抜き出す固有表現抽出技術が注目されている。
特開２００６−０４８５３６号公報 With the spread of computers and the development of network technology, the exchange of electronic information via networks has become popular. As a result, many of the business processes that have been conventionally performed on a paper basis are being replaced by network-based processes. Advances in digitalization and network technology have drastically reduced information acquisition costs. In this situation, information extraction technology that extracts central information from individual documents, especially named entities such as names, company names, dates, and job titles from documents. Attention has been focused on the technique for extracting named expressions.
JP 2006-048536 A

文書ファイルに含まれるさまざまな用語のうち、いわゆる固有表現に該当する用語は、文書の内容を特徴づける重要な情報であることが多い。しかし、絶えず新たな固有表現が生み出されるという現状において、どの単語が固有表現に該当し、どの単語が固有表現に該当しないのかという判別は容易ではない。 Of various terms included in a document file, a term corresponding to a so-called unique expression is often important information that characterizes the content of a document. However, in the current situation where new specific expressions are constantly created, it is not easy to determine which words correspond to specific expressions and which words do not correspond to specific expressions.

本発明の目的は、文書からの情報抽出精度、特には、固有表現抽出の精度を向上させるための技術、を提供することである。 An object of the present invention is to provide a technique for improving the accuracy of extracting information from a document, in particular, the accuracy of extracting a specific expression.

本発明のある態様は、文書処理装置である。
この装置は、複数のクラスに単語を分類した上で、所定のコーパス（corpus）における単語の素性（そせい）をクラスごとのクラス素性情報として保持する。この装置は、検査対象文書から単語を抽出し、検査対象文書におけるその単語の素性とクラス素性情報の適合度を複数のクラスのそれぞれについて算出し、所定のクラスに対して算出された適合度を調整した上で、各クラスに対する適合度に基づいて抽出した単語に対応するクラスを特定する。 One embodiment of the present invention is a document processing apparatus.
This device classifies words into a plurality of classes, and holds the word features of a predetermined corpus as class feature information for each class. This apparatus extracts a word from a document to be examined, calculates the feature of the word in the document to be examined and the suitability of class feature information for each of a plurality of classes, and calculates the suitability calculated for a predetermined class. After the adjustment, the class corresponding to the extracted word is specified based on the fitness for each class.

ここでいう素性とは、文書中における単語の出現態様を示す情報である。たとえば、後ろに「さん」が付いていたり、前に「ミスター」、「親愛なる」といった用語が付いている単語は、人名を示す単語である可能性が高いといえる。検査対象文書から抽出した単語の素性と、あるクラス（カテゴリ）に属する単語の素性の傾向を比較することにより、抽出した単語がどのクラスに対応した単語であるか特定される。このとき、所定のクラス、たとえば、コーパスにおいて最も多くの単語が属するクラスについての適合度を調整することにより、情報抽出精度を向上させている。 The feature here is information indicating an appearance mode of a word in a document. For example, it can be said that a word with “san” in the back or a term with “Mr.” or “dear” in front is likely to be a word indicating a person's name. By comparing the feature of the word extracted from the document to be examined with the tendency of the feature of the word belonging to a certain class (category), it is specified to which class the extracted word corresponds. At this time, the accuracy of information extraction is improved by adjusting the fitness of a predetermined class, for example, the class to which the most words belong in the corpus.

なお、以上の構成要素の任意の組み合わせ、本発明の表現を方法、システム、記録媒体、コンピュータプログラムなどの間で変換したものもまた、本発明の態様として有効である。 It should be noted that any combination of the above-described constituent elements and a conversion of the expression of the present invention between a method, a system, a recording medium, a computer program, etc. are also effective as an aspect of the present invention.

本発明によれば、文書からの情報抽出精度を向上させることができる。 According to the present invention, it is possible to improve information extraction accuracy from a document.

図１は、文書処理装置１００による処理の概要を説明するための模式図である。
文書処理装置１００の処理は、「学習処理」と「検査処理」の２段階に分けることができる。各段階について、それぞれの処理プロセスを概説すると以下の通りである。なお、本実施例における「文書ファイル」は、ＸＭＬ（eXtensible Markup Language）などの所定のタグセットによって構造化される構造化文書ファイルであるとして説明する。 FIG. 1 is a schematic diagram for explaining an outline of processing by the document processing apparatus 100.
The processing of the document processing apparatus 100 can be divided into two stages of “learning processing” and “inspection processing”. The outline of each processing process for each stage is as follows. The “document file” in the present embodiment will be described as a structured document file structured by a predetermined tag set such as XML (eXtensible Markup Language).

１．学習処理
学習処理は、学習コーパス（corpus）２００からクラス素性情報を生成して、クラス素性保持部１７０に登録するまでの処理である。本実施例における学習コーパス２００とは、大量の文書ファイルの集合である。
学習コーパス２００に含まれている膨大な単語には、「クラス」が設定されている。クラスとは、単語のカテゴリを示す概念である。たとえば、学習コーパス２００中のある文書ファイルにおいて「渋谷」という単語には「地名」というクラスが設定される。このような設定は、通常、人手で行われる。より具体的には、学習コーパス２００内の所定の文書ファイル中の「渋谷」という単語に対して「クラス」という名前空間（namespace）にて＜地名＞というタグが設定されている。クラスは、タグではなく属性（attribute）として設定されてもよい。地名のほかにも人名や、年月日、組織名、役職、数値表現など、固有表現に該当する単語を分類するために数種類のクラスが用意されている。文書処理装置１００は、このような学習コーパス２００から単語を抽出する。 1. Learning Process The learning process is a process from generating class feature information from the learning corpus 200 to registering it in the class feature holding unit 170. The learning corpus 200 in this embodiment is a collection of a large number of document files.
A “class” is set for a huge number of words included in the learning corpus 200. A class is a concept indicating a category of words. For example, a class “place name” is set for the word “Shibuya” in a document file in the learning corpus 200. Such setting is usually performed manually. More specifically, a tag <place name> is set in a name space “class” for the word “Shibuya” in a predetermined document file in the learning corpus 200. Classes may be set as attributes instead of tags. In addition to place names, there are several types of classes for classifying words that correspond to specific expressions such as personal names, dates, organization names, job titles, and numerical expressions. The document processing apparatus 100 extracts words from such a learning corpus 200.

学習コーパス２００には明示的になんらかのクラスが指定される単語もあれば、固有表現に該当しないとして明示的にクラス指定されていない単語もある。通常は、前者に比べて後者の方が圧倒的に多くなる。このような固有表現に該当しない単語は、「Ｎクラス」として分類される。これに対して、固有表現に該当する単語に対して明示的に指定されているクラスのことを「固有クラス」とよぶことにする。本実施例においては、説明を簡単にするために、固有クラスとしては「人名」、「地名」、「日時」の３つのクラスだけを想定し、それ以外の単語は全てＮクラスに分類される。すなわち、学習コーパス２００に含まれる全ての単語は、「人名」、「地名」、「日時」の３つの固有クラスと１つのＮクラスの計４つのクラスのうちのいずれかに分類されることになる。 In the learning corpus 200, some words are explicitly designated by some class, and some words are not explicitly designated by the class as not corresponding to the specific expression. Usually, the latter is overwhelmingly larger than the former. Words that do not correspond to such a specific expression are classified as “N class”. On the other hand, a class explicitly specified for a word corresponding to a specific expression is called a “unique class”. In this embodiment, to simplify the explanation, only three classes of “person name”, “place name”, and “date / time” are assumed as unique classes, and all other words are classified into N classes. . In other words, all the words included in the learning corpus 200 are classified into any one of four classes, ie, three unique classes of “person name”, “place name”, and “date and time” and one N class. Become.

次に、文書処理装置１００は学習コーパス２００から収集された各単語について「素性（そせい）」を抽出する。素性とは、文書ファイル中における単語の出現態様を示す。たとえば、さきほどの「渋谷」という単語は、学習コーパス２００では「渋谷駅で待ち合わせを・・・」という文脈で使用されていたとする。また、別の「代官山」という「地名」クラスに分類された単語は、学習コーパス２００では「代官山駅で事故が・・」という文脈で使用されているとする。共に「地名」クラスに属する２つの単語に共通する点は、後ろに「駅」という単語が出現していることである。いいかえれば、文書ファイル中に「駅」という単語が出現するときには、直前に位置する単語は「地名」クラスである可能性が高いという推測が成り立つ。他の例として、「渋谷さんが・・・」というように後ろに「さん」という単語が付く場合には、「人名」クラスである可能性が高い。このように、学習コーパス２００の単語、クラス、素性から、クラスごとの素性をクラス素性情報としてクラス素性保持部１７０に登録する。たとえば、「地名」クラスでは該当単語のうち、後ろに「駅」が付く確率は３０％、「人名」クラスでは該当単語のうち、後ろに「駅」が付く確率は１％、といった具合にクラスごとに、その属する単語の素性がクラス素性情報として登録される。 Next, the document processing apparatus 100 extracts “feature” for each word collected from the learning corpus 200. The feature indicates an appearance mode of a word in a document file. For example, it is assumed that the word “Shibuya” is used in the context of “Meeting at Shibuya Station” in the learning corpus 200. Further, it is assumed that words classified into another “place name” class “Daikanyama” are used in the learning corpus 200 in the context of “Accident at Daikanyama Station”. The point common to the two words belonging to the “place name” class is that the word “station” appears behind. In other words, when the word “station” appears in the document file, it is assumed that the word located immediately before is likely to be the “place name” class. As another example, when the word “san” is added behind “Mr. Shibuya is ...”, there is a high possibility that the class is “person name”. Thus, the features for each class are registered in the class feature holding unit 170 as class feature information from the words, classes, and features of the learning corpus 200. For example, in the “place name” class, the probability of “station” following 30% of the corresponding words, and in the “person name” class, the probability of “station” following the corresponding word is 1%. Each time, the feature of the word to which it belongs is registered as class feature information.

２．検査処理
検査処理は、「クラス」が明示的に指定されていない検査対象文書中の単語に対して、クラスを特定する処理である。「検査対象文書」とは、文書処理装置１００の検査処理により、文書中の単語のクラスを新たに特定すべき文書ファイルである。 2. Inspection Process The inspection process is a process for specifying a class for words in an inspection target document for which “class” is not explicitly specified. The “inspection document” is a document file that should newly specify a class of words in the document by the inspection processing of the document processing apparatus 100.

まず、文書処理装置１００は、「クラス」が指定されていない検査対象文書である加工前検査対象文書２１０を取得する。文書処理装置１００は、加工前検査対象文書２１０から順次単語を抽出する。また、加工前検査対象文書２１０における各単語の素性も検出する。 First, the document processing apparatus 100 acquires a pre-processing inspection target document 210 that is an inspection target document for which “class” is not specified. The document processing apparatus 100 sequentially extracts words from the pre-processing inspection target document 210. Further, the feature of each word in the pre-processing inspection target document 210 is also detected.

次に、文書処理装置１００は、加工前検査対象文書２１０の単語の素性とクラス素性保持部１７０におけるクラス素性情報を比較する。たとえば、加工前検査対象文書２１０の単語Ａに対して、単語Ａの素性と「人名」クラスのクラス素性情報を比較し、その類似度を「適合度」として算出する。単語Ａについては、そのほか、「地名」クラスに対する適合度、「日時」クラスに対する適合度、「Ｎ」クラスに対する適合度の計４種類の適合度が算出される。適合度の具体的な計算方法については後述する。 Next, the document processing apparatus 100 compares the word feature of the pre-processing inspection target document 210 with the class feature information in the class feature holding unit 170. For example, for the word A in the pre-processing inspection target document 210, the feature of the word A is compared with the class feature information of the “person name” class, and the similarity is calculated as the “fitness”. For word A, in addition, four types of fitness are calculated: fitness for the “place name” class, fitness for the “date and time” class, and fitness for the “N” class. A specific method for calculating the fitness will be described later.

文書処理装置１００は、４種類の適合度のうち、最も大きい適合度となったクラスを、単語Ａのクラスとして分類する。ここでは、単語Ａが「地名」クラスに分類されたとする。その場合、文書処理装置１００は加工前検査対象文書２１０の単語Ａに対して＜地名＞タグを付与する。一方、単語Ａが「Ｎ」クラスであれば、タグを付与しない。こうして、文書処理装置１００は加工前検査対象文書２１０に含まれる全ての単語について、その素性とクラス素性情報からクラスを特定する。文書処理装置１００は、加工済検査対象文書２１２を出力する。加工済検査対象文書２１２は、加工前検査対象文書２１０に対して固有クラスを示すタグ（以下、「固有タグ」とよぶ）が付与された文書ファイルである。なお、同図において、斜線が示されている文書ファイルは、固有タグが付与されている文書ファイルである。
加工済検査対象文書２１２の利用例として、加工済検査対象文書２１２の固有タグを検索キーとして、加工前検査対象文書２１０から求める情報を検索してもよい。たとえば、「渋谷で１６時に加藤さんと待ち合わせ」という文章のように、「地名」クラスと「日時」クラス、「人名」クラスを含む文章構造を検索パターンとすることにより、加工前検査対象文書２１０に含まれる情報のうち、「いつ、どこで、誰と」というパターンの情報を効率的に捜すことができる。 The document processing apparatus 100 classifies the class having the highest fitness among the four types of fitness as the class of the word A. Here, it is assumed that the word A is classified into the “place name” class. In that case, the document processing apparatus 100 gives a <place name> tag to the word A of the inspection target document 210 before processing. On the other hand, if the word A is “N” class, no tag is given. In this way, the document processing apparatus 100 identifies classes from the features and the class feature information for all the words included in the pre-processing inspection target document 210. The document processing apparatus 100 outputs the processed inspection target document 212. The processed inspection target document 212 is a document file to which a tag indicating a unique class (hereinafter referred to as “unique tag”) is assigned to the pre-processing inspection target document 210. In the figure, a document file indicated by diagonal lines is a document file to which a unique tag is assigned.
As an example of using the processed inspection target document 212, information required from the pre-processing inspection target document 210 may be searched using the unique tag of the processed inspection target document 212 as a search key. For example, a document structure including a “place name” class, a “date and time” class, and a “person name” class is used as a search pattern, such as a sentence “Meeting with Mr. Kato at 16:00 in Shibuya”. Among the information included in the, information of the pattern “when, where, with whom” can be searched efficiently.

以上の処理プロセスは、ベイズ（bays）アルゴリズムを用いた典型的な固有表現抽出プロセスにおいても共通するプロセスである。しかし、上記したような処理プロセスの場合、加工前検査対象文書２１０中における単語群のうち、実際には固有表現にあたらない単語（以下、「非固有単語」とよぶ）であっても文書処理装置１００によって固有表現にあたる単語（以下、「固有単語」とよぶ）にあたるとして固有クラスを特定されることが多い。本実施例では、上記プロセスに新たなプロセスを追加することにより、このような固有表現抽出アルゴリズムにおける課題を解決している。具体的な方法については、図５以降に関連して詳述するとして、まず、文書処理装置１００の構成を先に説明する。 The above processing process is also a common process in a typical named entity extraction process using the Bayes algorithm. However, in the case of the processing process as described above, even if the word group in the pre-processing inspection target document 210 is a word that does not actually correspond to a specific expression (hereinafter referred to as a “non-unique word”), document processing is performed. In many cases, a unique class is identified by the device 100 as a word corresponding to a unique expression (hereinafter referred to as a “unique word”). In this embodiment, a problem in such a unique expression extraction algorithm is solved by adding a new process to the above process. A specific method will be described in detail with reference to FIG. 5 and subsequent drawings. First, the configuration of the document processing apparatus 100 will be described first.

図２は、文書処理装置１００の機能ブロック図である。
ここに示す各ブロックは、ハードウェア的には、コンピュータのＣＰＵをはじめとする素子や機械装置で実現でき、ソフトウェア的にはコンピュータプログラム等によって実現されるが、ここでは、それらの連携によって実現される機能ブロックを描いている。したがって、これらの機能ブロックはハードウェア、ソフトウェアの組み合わせによっていろいろなかたちで実現できることは、当業者には理解されるところである。 FIG. 2 is a functional block diagram of the document processing apparatus 100.
Each block shown here can be realized in hardware by an element such as a CPU of a computer or a mechanical device, and in software it is realized by a computer program or the like. Draw functional blocks. Therefore, those skilled in the art will understand that these functional blocks can be realized in various forms by a combination of hardware and software.

文書処理装置１００は、ユーザインタフェース処理部１１０、データ処理部１２０およびデータ保持部１３０を含む。
ユーザインタフェース処理部１１０は、ユーザからの入力処理やユーザに対する情報表示のようなユーザインタフェース全般に関する処理を担当する。本実施例においては、ユーザインタフェース処理部１１０により文書処理装置１００のユーザインタフェースサービスが提供されるものとして説明する。別例として、ユーザはインターネットを介して文書処理装置１００を操作してもよい。この場合、図示しない通信部が、ユーザ端末からの操作指示情報を受信し、またその操作指示に基づいて実行された処理結果情報をユーザ端末に送信することになる。 The document processing apparatus 100 includes a user interface processing unit 110, a data processing unit 120, and a data holding unit 130.
The user interface processing unit 110 is in charge of processing related to the entire user interface such as input processing from the user and information display for the user. In the present embodiment, the user interface processing unit 110 will be described as providing the user interface service of the document processing apparatus 100. As another example, the user may operate the document processing apparatus 100 via the Internet. In this case, a communication unit (not shown) receives operation instruction information from the user terminal, and transmits processing result information executed based on the operation instruction to the user terminal.

データ処理部１２０は、ユーザインタフェース処理部１１０から取得されたデータを元にして各種のデータ処理を実行する。データ処理部１２０は、ユーザインタフェース処理部１１０とデータ保持部１３０の間のインタフェースの役割も果たす。データ保持部１３０は、あらかじめ用意された設定データや、データ処理部１２０から受け取ったデータなど、さまざまなデータを格納する。データ保持部１３０は、クラス素性情報を保持するためのクラス素性保持部１７０を含む。 The data processing unit 120 executes various data processing based on the data acquired from the user interface processing unit 110. The data processing unit 120 also serves as an interface between the user interface processing unit 110 and the data holding unit 130. The data holding unit 130 stores various data such as setting data prepared in advance and data received from the data processing unit 120. The data holding unit 130 includes a class feature holding unit 170 for holding class feature information.

ユーザインタフェース処理部１１０は、入力部１１２と表示部１１４を含む。入力部１１２は、ユーザからの入力操作を受け付ける。表示部１１４は、ユーザに対して各種情報を表示する。入力部１１２は、検査対象文書等の文書ファイルを取得するための文書取得部１１６を含む。 The user interface processing unit 110 includes an input unit 112 and a display unit 114. The input unit 112 receives an input operation from the user. The display unit 114 displays various information to the user. The input unit 112 includes a document acquisition unit 116 for acquiring a document file such as an inspection target document.

データ処理部１２０は、検査部１２２と調整係数特定部１３４を含む。検査部１２２は、文書ファイルを検査して、文書ファイル中の単語についてのクラスを特定する。検査部１２２は、単語抽出部１２４、適合度計算部１２６、適合度調整部１２８およびクラス分類部１３２を含む。単語抽出部１２４は、文書ファイルから単語を抽出し、その素性を検出する。適合度計算部１２６は、クラス素性保持部１７０のクラス素性情報と、抽出された単語の素性の類似度を適合度という指標により算出する。適合度の計算方法については任意に設定すればよいが、一例として、次のような計算方法により算出してもよい。 The data processing unit 120 includes an inspection unit 122 and an adjustment coefficient specifying unit 134. The inspection unit 122 inspects the document file and identifies a class for words in the document file. The inspection unit 122 includes a word extraction unit 124, a fitness level calculation unit 126, a fitness level adjustment unit 128, and a class classification unit 132. The word extraction unit 124 extracts words from the document file and detects their features. The goodness-of-fit calculation unit 126 calculates the similarity between the class feature information of the class feature holding unit 170 and the extracted feature of the word based on an index of goodness of fit. The calculation method of the fitness may be arbitrarily set, but as an example, it may be calculated by the following calculation method.

仮に、検査対象文書中の単語Ａの直後の単語が「駅」であったとする（以下、単語Ａのように、クラス分類のための検査対象となっている単語のことを「検査対象単語」とよぶことにする）。学習コーパス２００から求められたクラス素性情報によると、地名クラスに属する１０００単語のうち、直後に「駅」という単語が現れる単語は２００単語（２０％）であったとする。また、人名クラスに属する２０００単語のうちでは２０単語（１％）、日時クラスに属する５００単語のうちでは１０単語（２％）、Ｎクラスに属する１００００単語のうちでは１５００単語（１５％）であったとする。このとき、単語Ａの地名クラス、人名クラス、日時クラスおよびＮクラスのそれぞれに対する適合度は、２０点、１点、２点、１５点となる。この場合、検査対象単語Ａは、その素性から「地名」クラスに属する単語である可能性が高いという推測が成り立つ。なお、素性は、検査対象単語の直後の単語に限らず、文書ファイル中において検査対象単語の周辺に現れる各種単語を対象としてもよい。素性の定義としては、機械学習分野における既知のパラメータを用いればよい。ただし、本実施例では説明を簡単にするため、検査対象単語の直後の単語を、検査対象単語の素性として扱うものとする。 Suppose that the word immediately after the word A in the document to be inspected is “station” (hereinafter, the word to be inspected for classification as the word A is “inspection word”). I will call it). According to the class feature information obtained from the learning corpus 200, it is assumed that the word “station” appears immediately after 200 words (20%) among the 1000 words belonging to the place name class. Moreover, 20 words (1%) out of 2000 words belonging to the personal name class, 10 words (2%) out of 500 words belonging to the date class, and 1500 words (15%) out of 10000 words belonging to the N class. Suppose there was. At this time, the fitness of the word A with respect to each of the place name class, personal name class, date / time class, and N class is 20 points, 1 point, 2 points, and 15 points. In this case, it is assumed that the inspection target word A is highly likely to be a word belonging to the “place name” class based on its features. The feature is not limited to the word immediately after the inspection target word, but may be various words appearing around the inspection target word in the document file. As the definition of the feature, a known parameter in the machine learning field may be used. However, in this embodiment, to simplify the explanation, the word immediately after the inspection target word is treated as the feature of the inspection target word.

クラス分類部１３２は、適合度に基づいて文書ファイル中の単語のクラスを特定する。上記設例の場合であれば、単語Ａのクラスは適合度が最高の２０点となっている地名クラスとなる。適合度調整部１２８は、適合度計算部１２６が算出した適合度を調整する。適合度調整部１２８による調整処理により、「実際には非固有単語であっても文書処理装置１００によって固有単語として特定されることが多い」という先述した課題を解決している。適合度調整部１２８は調整係数特定部１３４によって特定された調整係数Ｔを用いて、検査対象単語のＮクラスに対する適合度に（２−Ｔ）を乗じることによりその適合度を調整する。更に詳しくは図５以降に関連して後述する。 The class classification unit 132 specifies a class of words in the document file based on the fitness. In the case of the above example, the class of the word A is a place name class having the highest matching score of 20 points. The fitness level adjustment unit 128 adjusts the fitness level calculated by the fitness level calculation unit 126. The adjustment processing by the matching level adjustment unit 128 solves the above-described problem that “there is often a case where a non-unique word is actually identified as a unique word by the document processing apparatus 100”. Using the adjustment coefficient T specified by the adjustment coefficient specifying unit 134, the fitness level adjustment unit 128 adjusts the fitness level by multiplying the fitness level of the inspection target word with respect to the N class by (2-T). Further details will be described later with reference to FIG.

調整係数特定部１３４は調整係数Ｔを特定する。調整係数特定部１３４は、暫定最適係数特定部１３６を含む。暫定最適係数特定部１３６は調整係数Ｔを求める上で必要となる暫定最適係数Ｓを特定する。暫定最適係数については、図７に関連して詳述する。 The adjustment coefficient specifying unit 134 specifies the adjustment coefficient T. The adjustment coefficient specifying unit 134 includes a temporary optimal coefficient specifying unit 136. The provisional optimum coefficient specifying unit 136 specifies the provisional optimum coefficient S necessary for obtaining the adjustment coefficient T. The provisional optimum coefficient will be described in detail with reference to FIG.

図３は、調整処理を実行しない場合における検査処理過程を示すフローチャートである。ここではある特定の検査対象文書中に含まれるある特定の単語を検査対象単語とした場合における処理過程を示している。したがって、同図に示すフローチャートは、検査対象文書中の各単語に対して順次実行される。
まず、単語抽出部１２４は、文書取得部１１６が取得した検査対象文書から検査対象単語を順次抽出する（Ｓ１０）。単語抽出部１２４は検査対象単語の素性を検出する（Ｓ１２）。ここでは、検査対象単語の直後に現れている単語を素性として検出する。 FIG. 3 is a flowchart showing the inspection process when the adjustment process is not executed. Here, the process in the case where a specific word included in a specific inspection target document is set as the inspection target word is shown. Therefore, the flowchart shown in the figure is sequentially executed for each word in the document to be examined.
First, the word extraction unit 124 sequentially extracts inspection target words from the inspection target document acquired by the document acquisition unit 116 (S10). The word extraction unit 124 detects the feature of the inspection target word (S12). Here, a word appearing immediately after the inspection target word is detected as a feature.

適合度計算部１２６は、４種類のクラスのうち、まず、人名クラスを選択する（Ｓ１４）。適合度計算部１２６は、クラス素性保持部１７０の人名クラスについてのクラス素性情報と検査対象単語の素性に基づいて、検査対象単語の人名クラスに対する適合度を計算する（Ｓ１６）。Ｓ１４およびＳ１６の処理は、人名クラスのほかの各クラスについて実行される。全てのクラスについて適合度の計算が完了していないときには（Ｓ１８のＮ）、処理はＳ１４に戻って次のクラスが選択される。完了すると（Ｓ１８のＹ）、クラス分類部１３２は４種類の適合度に基づいて、検査対象単語のクラスを特定する（Ｓ２０）。特定されたクラスがＮクラスでなければ（Ｓ２２のＮ）、いいかえれば、いずれかの固有クラスであれば、クラス分類部１３２は検査対象単語に対して該当する固有タグを挿入する（Ｓ２４）。Ｎクラスであれば（Ｓ２２のＹ）、そのまま処理は終了する。このような処理を検査対象文書中の全ての単語について実行することにより、検査対象文書中の各単語のクラスを特定する。 The fitness level calculation unit 126 first selects a personal name class from the four types of classes (S14). The goodness-of-fit calculation unit 126 calculates the goodness of the test target word with respect to the personal name class based on the class feature information about the personal name class of the class feature holding unit 170 and the feature of the test target word (S16). The processes of S14 and S16 are executed for each class other than the personal name class. When the calculation of the fitness is not completed for all classes (N in S18), the process returns to S14 and the next class is selected. When completed (Y in S18), the class classification unit 132 specifies the class of the inspection target word based on the four types of matching degrees (S20). If the identified class is not N class (N in S22), in other words, if it is any unique class, the class classification unit 132 inserts a corresponding unique tag for the inspection target word (S24). If it is N class (Y in S22), the processing is ended as it is. By executing such processing for all the words in the inspection target document, the class of each word in the inspection target document is specified.

このような検査処理の精度を指標化するために、適合率（precision）、再現率（recall）およびＦ値（F measure）とよばれる３つの指標を用いている。本実施例では、これらの指標を、「実際の固有単語」と「固有単語として特定された単語」との一致の度合いを測るために用いている。仮に、検査対象文書中にＮ_１個の単語が含まれており、そのうち、Ｎ_２個が固有表現単語であるとする。そして、文書処理装置１００は、この検査対象文書中のＮ_１個の単語の中からＮ_３個の単語を固有単語であるとして検出したとする。このとき、Ｎ_４＝Ｎ_２∩Ｎ_３、すなわち、Ｎ_４を、実際の固有単語のうち文書処理装置１００によって正しく検出された固有単語の数とすると、各指標は、
適合率Ｖ＝Ｎ_４／Ｎ_３
再現率Ｗ＝Ｎ_４／Ｎ_２
Ｆ値＝２・Ｖ・Ｗ／（Ｖ＋Ｗ）
として求められる。このようなＦ値の算出式は、適合率Ｖと再現率Ｗの調和平均である。 In order to index the accuracy of such inspection processing, three indexes called precision, recall, and F value are used. In this embodiment, these indexes are used to measure the degree of matching between “actual unique words” and “words identified as unique words”. It is assumed that N ₁ words are included in the document to be examined, and N _{2 of} them are specific expression words. Then, it is assumed that the document processing apparatus 100 detects N ₃ words as unique words from the N ₁ words in the inspection target document. At this time, if N ₄ = N ₂ ∩N ₃ , that is, N ₄ is the number of unique words correctly detected by the document processing apparatus 100 among the actual unique words, each index is
Precision V = N ₄ / N ₃
Reproducibility W = N ₄ / N ₂
F value = 2 · V · W / (V + W)
As required. Such a formula for calculating the F value is a harmonic average of the precision V and the recall W.

適合率Ｖは、クラス分類部１３２によって固有単語と特定された単語のうち、実際に固有単語である単語の比率であり、いわば、誤検出がいかに発生していないかを示す指標である。再現率Ｗは、実際の固有単語のうちクラス分類部１３２によってに固有単語として検出された単語の比率であり、いわば、いかに固有単語がもれなく検出されているかを示す指標である。適合率Ｖと再現率Ｗは両方とも１．０に近いことが望ましいが、通常、両者はトレードオフの関係にある。Ｆ値は、これらの両方の指標に基づいて、システム全体を評価するために用いられる指標値である。Ｆ値が大きいほど、システムとしての固有表現検出性能がよいことになる。 The relevance rate V is the ratio of words that are actually unique words among the words that are identified as unique words by the class classification unit 132, and is an index that indicates how misdetection has occurred. The recall rate W is a ratio of words detected as unique words by the class classification unit 132 among actual unique words, and is an index indicating how many unique words are detected. It is desirable that both the precision V and the recall W are close to 1.0, but usually they are in a trade-off relationship. The F value is an index value used to evaluate the entire system based on both of these indices. The larger the F value, the better the unique expression detection performance as a system.

図４は、調整処理を実行しない場合における適合率と再現率を示すグラフ図である。
横軸は学習コーパス２００のデータサイズを示す。縦軸は、適合率および再現率の値を示す。同図に示すグラフは、本発明者が所定の学習コーパス２００を対象として実験を行った結果を示している。このグラフからわかるように、学習コーパス２００のサイズが大きいほど、いいかえれば、クラス素性情報が充実するほど、再現率や適合率は共に漸増している。しかし、学習コーパス２００のサイズが大きくなるということは、学習コーパス２００中の単語にクラスを指定する作業が増加することになる。また、クラス素性保持部１７０のデータ量の増加は、文書処理装置１００のメモリを圧迫することはいうまでもない。このグラフによれば、学習コーパス２００のサイズが１０倍、１００倍となっても、再現率や適合率はそれほど大きく改善されないことがわかる。特に、適合率が低いという問題がある。適合率の低さは誤検出の多さを意味する。すなわち、固有単語と特定されている非固有単語が多く発生している。本発明者は、固有単語と特定されにくくするために、いいかえれば、非固有単語と特定されやすくするために調整係数Ｔを小さくして、検査対象単語のＮクラスに対する適合度が大きくなるように調整することにより、Ｆ値を改善できると考えた。 FIG. 4 is a graph showing the precision and recall when the adjustment process is not executed.
The horizontal axis indicates the data size of the learning corpus 200. The vertical axis shows the precision and recall values. The graph shown in the figure shows the results of experiments conducted by the inventor on a predetermined learning corpus 200. As can be seen from this graph, as the size of the learning corpus 200 is larger, in other words, as the class feature information is enriched, both the recall rate and the matching rate gradually increase. However, the increase in the size of the learning corpus 200 increases the work of designating classes for words in the learning corpus 200. Needless to say, the increase in the data amount of the class feature storage unit 170 puts pressure on the memory of the document processing apparatus 100. According to this graph, it can be seen that even if the size of the learning corpus 200 is 10 times or 100 times, the recall rate and the matching rate are not greatly improved. In particular, there is a problem that the precision is low. Low precision means a lot of false detections. That is, many non-unique words identified as unique words are generated. In order to make it difficult for the inventor to be identified as a unique word, in other words, the inventor reduces the adjustment coefficient T in order to facilitate identification as a non-unique word, so that the matching degree of the word to be examined with respect to the N class is increased. It was thought that the F value could be improved by adjusting.

図５は、調整係数Ｔを変化させたときの適合率および再現率を示すグラフ図である。
横軸は、調整係数Ｔを示し、縦軸は、適合率および再現率の値を示す。以下、「Ｎクラスに対する適合度」として、調整前の適合度を調整前適合度Ｘ、調整後の適合度を調整後適合度Ｙとよぶ。
調整後適合度Ｙ＝調整前適合度Ｘ×（２−調整係数Ｔ）
である。したがって、調整係数Ｔ＝１．０のときが、調整無しの場合に相当する。また、調整係数が小さいほど、調整後適合度Ｙが大きくなる。すなわち、検査対象単語がＮクラスに分類されやすくなる。 FIG. 5 is a graph showing the precision and the recall when the adjustment coefficient T is changed.
The horizontal axis represents the adjustment coefficient T, and the vertical axis represents the precision and recall values. Hereinafter, as the “degree of conformity with respect to the N class”, the degree of conformity before adjustment is referred to as the degree of conformance X before adjustment, and the degree of conformity after adjustment is referred to as the degree of conformance Y after adjustment.
Post-adjustment fitness Y = adjustment fitness X × (2−adjustment coefficient T)
It is. Therefore, the adjustment coefficient T = 1.0 corresponds to the case where there is no adjustment. Also, the smaller the adjustment coefficient, the greater the post-adjustment fitness Y. That is, the inspection target word is easily classified into N classes.

調整係数Ｔを小さくする場合、再現率は低下するが適合率は上昇する。調整係数Ｔが小さくなると、Ｎクラスに対する調整後適合度Ｙが増加する。固有単語に分類される単語の数Ｎ_３が絞り込まれるため、適合率（Ｖ＝Ｎ_４／Ｎ_３）が増加する。一方、固有単語として分類される単語の数Ｎ_３の減少にともなって、実際の固有単語のうち文書処理装置１００によって正しく検出された固有単語の数Ｎ_４も減少するため、再現率（Ｗ＝Ｎ_４／Ｎ_２）が減少することになる。調整係数Ｔを大きくする場合には逆となる。したがって、Ｆ値が最大になるように調整係数Ｔの値を設定することにより、文書処理装置１００による検出対象単語の分類精度を向上させることができる。
次に、このような最適な調整係数Ｔを学習コーパス２００に基づいて自動的に求めるためのアルゴリズムについて説明する。本実施例における文書処理装置１００は、学習処理の段階で調整係数Ｔを特定する。 When the adjustment coefficient T is reduced, the reproduction rate is lowered, but the matching rate is increased. As the adjustment coefficient T becomes smaller, the adjusted fitness Y for the N class increases. Since the number N _{3 of} words classified as unique words is narrowed down, the precision (V = N ₄ / N ₃ ) increases. On the other hand, as the number N ₃ of words classified as unique words decreases, the number N _{4 of} unique words correctly detected by the document processing apparatus 100 among the actual unique words also decreases. N ₄ / N ₂ ) will decrease. The reverse is true when the adjustment coefficient T is increased. Therefore, by setting the value of the adjustment coefficient T so that the F value is maximized, the classification accuracy of the detection target word by the document processing apparatus 100 can be improved.
Next, an algorithm for automatically obtaining such an optimal adjustment coefficient T based on the learning corpus 200 will be described. The document processing apparatus 100 according to the present exemplary embodiment specifies the adjustment coefficient T at the learning process stage.

図６は、学習処理において、複数種類の調整係数に基づいて適合度を計算する処理過程を示すフローチャートである。
最適な調整係数Ｔの特定のために、学習コーパス２００をＫ個（Ｋは２以上の整数）のグループに分割する。たとえば、学習コーパス２００に１０００個の文書ファイルが含まれ、これを１０個のグループに分割する場合であれば、１００文書ずつグループ化してもよい。Ｋ個のグループのうち、第ｉグループ（１≦ｉ≦Ｋ）を検査グループ、それ以外のグループをコーパスグループとよぶことにする。たとえば、第１グループを検査グループとするときには、第２〜第Ｋグループまでをまとめたグループがコーパスグループとなる。同図は、Ｋ個のグループのうちの第ｉグループを検査グループとしたとき、第ｉグループのある特定の検査対象単語Ａについて実行される処理を示している。したがって、同図に示す処理は、第ｉグループに含まれる他の検査対象単語についても実行される。また、第ｉグループ以外のグループについても同様である。 FIG. 6 is a flowchart showing a process of calculating the fitness based on a plurality of types of adjustment coefficients in the learning process.
In order to specify the optimum adjustment coefficient T, the learning corpus 200 is divided into K groups (K is an integer of 2 or more). For example, if 1000 document files are included in the learning corpus 200 and are divided into 10 groups, 100 documents may be grouped. Of the K groups, the i-th group (1 ≦ i ≦ K) is referred to as a test group, and the other groups are referred to as corpus groups. For example, when the first group is set as the inspection group, a group in which the second to Kth groups are collected becomes a corpus group. The figure shows the processing executed for a specific inspection target word A in the i-th group when the i-th group of the K groups is the inspection group. Therefore, the process shown in FIG. 9 is also executed for other inspection target words included in the i-th group. The same applies to groups other than the i-th group.

まず、単語抽出部１２４は、検査グループである第ｉグループから検査対象単語Ａを抽出する（Ｓ３０）。単語抽出部１２４は、検査対象単語Ａの素性を検出する（Ｓ３２）。適合度計算部１２６は、４種類のクラスからクラスを選択する（Ｓ３４）。適合度計算部１２６は、第ｉグループ以外のグループから成るコーパスグループ、すなわち、学習コーパス２００から第ｉグループを除いた残りのコーパスに限定して得られるクラス素性情報と検査対象単語Ａの素性に基づいて、検査対象単語ＡのＳ３４にて選択されたクラスに対する適合度を計算する（Ｓ３６）。全てのクラスについて適合度の計算が完了していないときには（Ｓ３８のＮ）、処理はＳ３４に戻って次のクラスが選択される。完了しているときには（Ｓ３８のＹ）、適合度調整部１２８は暫定係数Ｒに１．０を設定する（Ｓ４０）。暫定係数は、調整係数の値を確定する前に、所定範囲、たとえば０．５〜１．０の範囲で可変となる暫定的な係数である。 First, the word extraction unit 124 extracts the inspection target word A from the i-th group that is the inspection group (S30). The word extraction unit 124 detects the feature of the inspection target word A (S32). The fitness level calculation unit 126 selects a class from four types of classes (S34). The goodness-of-fit calculation unit 126 applies class feature information obtained by limiting to a corpus group composed of groups other than the i-th group, that is, the remaining corpus excluding the i-th group from the learning corpus 200 and features of the inspection target word A. Based on this, the degree of suitability of the inspection target word A for the class selected in S34 is calculated (S36). If the calculation of the fitness is not completed for all classes (N in S38), the process returns to S34 and the next class is selected. When it is completed (Y in S38), the fitness adjustment unit 128 sets 1.0 to the provisional coefficient R (S40). The provisional coefficient is a provisional coefficient that is variable within a predetermined range, for example, a range of 0.5 to 1.0, before determining the value of the adjustment coefficient.

暫定係数Ｒが０．５より大きければ（Ｓ４２のＹ）、適合度調整部１２８は検査対象単語ＡのＮクラスに対する適合度を暫定係数Ｒにより調整する（Ｓ４４）。調整方法は、上述した通りである。クラス分類部１３２は４種類の適合度に基づいて、検査対象単語Ａのクラスを特定する（Ｓ４６）。適合度調整部１２８は、暫定係数から所定値、たとえば、０．０１を減算する（Ｓ４８）。暫定係数Ｒが０．５より大きければ、新たな暫定係数Ｒにより、Ｓ４４〜Ｓ４８までの処理が再実行される。暫定係数Ｒが０．５以下となると（Ｓ４２のＮ）、検査対象単語Ａに対する処理は終了する。このような処理により、第ｉグループの一つの検査対象単語Ａのクラスが、さまざまな暫定係数に基づいて特定される。たとえば、この検査対象単語Ａは暫定係数が０．８のときには人名クラスとして特定され、暫定係数が０．６のときにはＮクラスとして特定されるかもしれない。全てのグループを順次検査グループとして選択し、各グループに含まれる全ての単語を検査対象単語として順次選択する。結果として、学習コーパス２００の全ての単語について、暫定係数０．５から１．０までのそれぞれの場合におけるクラスが特定される。 If the provisional coefficient R is larger than 0.5 (Y in S42), the fitness adjustment unit 128 adjusts the fitness of the inspection target word A with respect to the N class by the provisional coefficient R (S44). The adjustment method is as described above. The class classification unit 132 specifies the class of the inspection target word A based on the four types of matching degrees (S46). The fitness adjustment unit 128 subtracts a predetermined value, for example, 0.01 from the provisional coefficient (S48). If the provisional coefficient R is larger than 0.5, the processes from S44 to S48 are re-executed by the new provisional coefficient R. When the provisional coefficient R is 0.5 or less (N in S42), the process for the inspection target word A ends. By such processing, the class of one inspection target word A in the i-th group is specified based on various provisional coefficients. For example, the inspection target word A may be specified as a personal class when the provisional coefficient is 0.8, and may be specified as the N class when the provisional coefficient is 0.6. All groups are sequentially selected as inspection groups, and all words included in each group are sequentially selected as inspection target words. As a result, for all words in the learning corpus 200, the class in each case of the provisional coefficients 0.5 to 1.0 is specified.

図７は、図６に示した処理が完了した後に、暫定最適係数Ｓを求める処理過程を示すフローチャートである。
ここでは、Ｋ個のグループのうちの第ｉグループについて最適な、いいかえれば、Ｆ値が最高となるときの暫定係数を暫定最適係数Ｓとして特定するための処理を示している。同図に示す処理は、第ｉグループ以外の各グループについても実行される。結果として、Ｋ個のグループに対して、Ｋ個の暫定最適係数Ｓが求められることになる。 FIG. 7 is a flowchart showing a process for obtaining the provisional optimum coefficient S after the process shown in FIG. 6 is completed.
Here, the process for specifying the provisional coefficient as the provisional optimum coefficient S that is optimum for the i-th group of the K groups, in other words, the F value that is the highest is shown. The process shown in the figure is also executed for each group other than the i-th group. As a result, K provisional optimum coefficients S are obtained for K groups.

まず、調整係数特定部１３４は、暫定係数Ｒに１．０を設定する（Ｓ５０）。暫定係数Ｒが０．５より大きければ（Ｓ５２のＹ）、暫定最適係数特定部１３６は、第ｉグループに含まれる単語についての適合率を算出する（Ｓ５４）。すなわち、第ｉグループに含まれる単語群のうち、固有単語として特定された単語のうち、実際に固有単語である単語の比率を適合率として算出する。次に、暫定最適係数特定部１３６は、第ｉグループに含まれる単語についての再現率を算出し（Ｓ５６）、適合率と再現率に基づいてＦ値を算出する（Ｓ５８）。暫定最適係数特定部１３６は、暫定係数Ｒから０．０１を減算する（Ｓ６０）。暫定係数Ｒが０．５以下となると（Ｓ５２のＮ）、暫定最適係数特定部１３６は、各暫定係数に対するＦ値のうち、Ｆ値が最大となったときの暫定係数Ｒを暫定最適係数Ｓとして特定する（Ｓ６２）。こうして、学習コーパス２００における第ｉグループにとって、Ｆ値が最大となるときの調整係数である暫定最適係数Ｓが特定される。 First, the adjustment coefficient specifying unit 134 sets 1.0 to the provisional coefficient R (S50). If the provisional coefficient R is greater than 0.5 (Y in S52), the provisional optimum coefficient specifying unit 136 calculates the relevance ratio for the words included in the i-th group (S54). That is, the ratio of words that are actually unique words among the words specified as unique words in the word group included in the i-th group is calculated as the relevance rate. Next, the provisional optimum coefficient specifying unit 136 calculates a recall rate for words included in the i-th group (S56), and calculates an F value based on the matching rate and the recall rate (S58). The provisional optimum coefficient specifying unit 136 subtracts 0.01 from the provisional coefficient R (S60). When the provisional coefficient R is 0.5 or less (N in S52), the provisional optimum coefficient specifying unit 136 determines the provisional coefficient R when the F value is maximum among the F values for each provisional coefficient as the provisional optimum coefficient S. (S62). Thus, for the i-th group in the learning corpus 200, the provisional optimum coefficient S, which is an adjustment coefficient when the F value is maximized, is specified.

同様にして、Ｋ個のグループのそれぞれに対して、Ｋ個の暫定最適係数を求める。調整係数計算部１３８は、Ｋ個の暫定最適係数の平均値を調整係数Ｔとする。このような処理方法によれば、既に、単語に対する正しいクラスが設定済の学習コーパス２００を利用して、好適なＦ値を求めることにより、学習コーパス２００に応じた妥当な調整係数を探ることができる。なお、平均値に限らず、調整係数Ｔは、Ｋ個の暫定最適係数を変数とする所定の演算式にて求めればよい。たとえば、Ｋ個の暫定最適係数Ｓのうち、Ｆ値が最高となるときの暫定最適係数をそのまま調整係数Ｔとしてもよい。仮に、Ｋ＝２として、第１グループにおける暫定最適係数Ｓが０．６でそのときのＦ値が０．８、第２グループにおける暫定最適係数Ｓが０．６５でそのときのＦ値が０．７５であったときには、調整係数Ｔは０．６となる。なぜならば、Ｆ値０．８の方が、他方のＦ値０．７５よりも大きいためである。 Similarly, K temporary optimal coefficients are obtained for each of the K groups. The adjustment coefficient calculation unit 138 sets the average value of the K provisional optimum coefficients as the adjustment coefficient T. According to such a processing method, a suitable adjustment coefficient corresponding to the learning corpus 200 can be found by obtaining a suitable F value using the learning corpus 200 in which the correct class for the word has already been set. it can. The adjustment coefficient T is not limited to the average value, and may be obtained by a predetermined arithmetic expression using K temporary optimum coefficients as variables. For example, of the K temporary optimum coefficients S, the temporary optimum coefficient when the F value is the highest may be used as the adjustment coefficient T as it is. Assuming that K = 2, the provisional optimum coefficient S in the first group is 0.6 and the F value is 0.8, and the provisional optimum coefficient S in the second group is 0.65 and the F value is 0. When it is .75, the adjustment coefficient T is 0.6. This is because the F value of 0.8 is larger than the other F value of 0.75.

また、本実施例では、Ｆ値が最大になるときの暫定係数を暫定最適係数としたが、Ｆ値に限らず、検査グループに含まれる単語の本来のクラスと特定されたクラスとの一致度を示す任意の指標に基づいて、その一致度が最大になるときの暫定係数を暫定最適係数Ｓとしてもよい。たとえば、検査グループの単語のうち、本来のクラスと特定されたクラスが一致する単語の数が最大になるときの暫定係数を暫定最適係数Ｓとしてもよい。一例として、検査グループに人名クラスを指定されている単語のうち、実際に人名クラスとして指定された単語の数が最大となるときの暫定係数を暫定最適係数としてもよい。 In the present embodiment, the provisional coefficient when the F value is maximized is the provisional optimum coefficient. However, the degree of coincidence between the original class of the words included in the examination group and the identified class is not limited to the F value. The provisional coefficient when the degree of coincidence becomes the maximum may be used as the provisional optimum coefficient S based on an arbitrary index indicating the above. For example, the provisional optimal coefficient S may be the provisional coefficient when the number of words that match the original class and the identified class among the words in the examination group is maximized. As an example, the provisional coefficient when the number of words actually designated as the person name class among the words designated as the person name class in the examination group may be the provisional optimum coefficient.

本実施例においては、学習コーパス２００をＫ個のグループに分類し、それぞれのグループを順次検査グループとして、Ｋ個の暫定最適係数Ｓを求めるとして説明したが、学習コーパス２００を２つに分割して、一方検査グループ、他方をコーパスグループとして１個の暫定最適係数Ｓを求めるとしてもよい。この場合、求められた暫定最適係数Ｓをそのまま調整係数としてもよい。学習コーパス２００を分割した各グループの全ての組み合わせにて検査してもよいが、全ての組み合わせによる検査は本発明の必須条件ではない。たとえば、学習コーパス２００をグループＡ、グループＢおよびグループＣの３つに分割した場合、（検査グループ、コーパスグループ）は（Ａ、Ｂ＋Ｃ）、（Ｂ、Ａ＋Ｃ）、（Ｃ、Ａ＋Ｂ）の３通りの組み合わせが考えられる。しかし、これら３つの組み合わせのうち、２つの組み合わせについてだけ検査を行うとしてもよい。 In the present embodiment, the learning corpus 200 is classified into K groups, and each group is sequentially set as a check group, and K provisional optimum coefficients S are obtained. However, the learning corpus 200 is divided into two. Thus, one temporary optimum coefficient S may be obtained with one inspection group and the other as a corpus group. In this case, the obtained provisional optimum coefficient S may be used as an adjustment coefficient as it is. You may test | inspect by all the combinations of each group which divided | segmented the learning corpus 200, but the test | inspection by all the combinations is not an essential condition of this invention. For example, when the learning corpus 200 is divided into three groups, group A, group B, and group C, (inspection group, corpus group) has three types (A, B + C), (B, A + C), and (C, A + B). The combination of is considered. However, only two combinations of these three combinations may be inspected.

図８は、調整処理を実行する場合における検査処理過程を示すフローチャートである。ここでも、ある検査対象文書中に含まれる単語を検査対象単語とする処理内容を示している。したがって、同図に示すフローチャートは、検査対象文書中の各単語に対して実行される。
Ｓ７０からＳ７８までの処理は、図３のＳ１０からＳ１８の処理と同じであるため、説明を割愛する。適合度調整部１２８は、調整係数特定部１３４によって決定された調整係数により、Ｎクラスに対する適合度を調整する（Ｓ８０）。クラス分類部１３２は、調整後の適合度を参照して、検査対象単語のクラスを特定する（Ｓ８２）。Ｓ８４以降の処理内容は、図３のＳ２２以降の処理内容と同様である。すなわち、検査処理においては、事実上、Ｎクラスに対する適合度に（２−調整係数Ｔ）を乗算する処理が追加されるだけである。 FIG. 8 is a flowchart showing the inspection process in the case where the adjustment process is executed. Also here, the processing contents in which a word included in a certain inspection target document is an inspection target word are shown. Therefore, the flowchart shown in the figure is executed for each word in the document to be examined.
The processing from S70 to S78 is the same as the processing from S10 to S18 in FIG. The fitness level adjustment unit 128 adjusts the fitness level for the N class using the adjustment coefficient determined by the adjustment coefficient specifying unit 134 (S80). The class classification unit 132 identifies the class of the inspection target word with reference to the adjusted fitness (S82). The processing content after S84 is the same as the processing content after S22 in FIG. That is, in the inspection process, a process of multiplying the fitness for the N class by (2−adjustment coefficient T) is actually added.

図９は、本実施例に示した調整係数特定アルゴリズムの検証結果を示すグラフ図である。
本発明者が所定の学習コーパス２００に対する学習処理を実行すると、調整係数特定部１３４は調整係数Ｔ＝０．９１２として算出された。このあと、あらかじめ特定されるべきクラスがわかっている検査対象文書について、調整係数Ｔをさまざまな値に変化させながら、適合率、再現率およびＦ値を実際に計算してみると、同図に示すようなグラフが形成された。実際には、調整係数Ｔ＝０．９００の近辺で、Ｆ値が極大値となっている。調整係数特定部１３４が特定した調整係数Ｔ＝０．９１２とは若干のずれがあるものの、かなり高い精度で、最適な調整係数Ｔを学習コーパス２００のデータから求めることに成功している。特に、調整処理を行わない場合、すなわち、調整係数Ｔ＝１．０の場合に比べれば、Ｆ値を格段に改善できることが確認された。 FIG. 9 is a graph showing the verification result of the adjustment coefficient specifying algorithm shown in this embodiment.
When the inventor executed a learning process for the predetermined learning corpus 200, the adjustment coefficient specifying unit 134 was calculated as an adjustment coefficient T = 0.912. After this, for the document to be inspected whose class to be specified in advance is known, when the adjustment factor T is changed to various values, the precision, recall, and F value are actually calculated. A graph as shown was formed. Actually, the F value is a maximum value in the vicinity of the adjustment coefficient T = 0.900. Although there is a slight deviation from the adjustment coefficient T = 0.912 specified by the adjustment coefficient specifying unit 134, the optimum adjustment coefficient T has been successfully obtained from the data of the learning corpus 200 with considerably high accuracy. In particular, it was confirmed that the F value can be remarkably improved as compared with the case where the adjustment process is not performed, that is, when the adjustment coefficient T = 1.0.

以上、本実施例に基づいて本発明を説明した。
文書処理装置１００による調整処理によれば、ベイズアルゴリズムに基づく文書ファイルからの情報抽出において、その抽出精度を向上させることができる。具体的には、検査対象文書からの固有単語の抽出精度を高めることができる。本実施例においては、固有クラスとＮクラスに大別することにより固有単語の抽出精度の向上を主たる目的として説明したが、所定のクラスＡとそれ以外のクラスを任意に設定してＦ値を求めることで、クラスＡに該当する単語の抽出精度を高めることもできる。Ｎクラスに限らず、複数のクラスのうち、学習コーパス２００において、最も多くの単語が分類されるクラスを調整の対象としてもよい。ただし、通常の学習コーパス２００の場合、最も多くの単語が分類されるクラスはＮクラスとなることが多いと考えられる。 The present invention has been described above based on the present embodiment.
According to the adjustment process by the document processing apparatus 100, the extraction accuracy can be improved in the information extraction from the document file based on the Bayes algorithm. Specifically, the extraction accuracy of unique words from the inspection target document can be increased. In this embodiment, the main purpose is to improve the extraction accuracy of specific words by roughly classifying them into specific classes and N classes. However, a predetermined class A and other classes are arbitrarily set, and an F value is set. By obtaining, the extraction accuracy of words corresponding to class A can be increased. Not only the N class, but the class in which the most words are classified in the learning corpus 200 among a plurality of classes may be the adjustment target. However, in the case of the normal learning corpus 200, it is considered that the class into which the most words are classified is often the N class.

更に、調整係数Ｔを学習コーパス２００をベースとして自動的に特定するアルゴリズムについて説明した。このため、文書処理装置１００は、学習処理の段階で、抽出精度が最大となる調整係数を自動的に求めることができる。 Further, the algorithm for automatically specifying the adjustment coefficient T based on the learning corpus 200 has been described. For this reason, the document processing apparatus 100 can automatically obtain the adjustment coefficient that maximizes the extraction accuracy at the stage of the learning process.

加工前検査対象文書２１０から加工済検査対象文書２１２を生成することにより、任意の検査対象文書において、どのようなクラスに属する単語が使われているかを検出しやすくなる。たとえば、「私は京都から奈良に旅行しました」という文章のように、人名クラス、地名クラス、地名クラスという順序で単語を含む文章は、人の移動を示す文章である可能性が比較的高い。このように、クラスやその並び方から所定のイベントに関連する情報を抽出しやすくなる。 By generating the processed inspection target document 212 from the pre-processing inspection target document 210, it is easy to detect what class words are used in an arbitrary inspection target document. For example, a sentence that includes words in the order of personal name class, place name class, place name class, such as "I traveled from Kyoto to Nara", is relatively likely to be a sentence indicating the movement of people. . In this way, it becomes easy to extract information related to a predetermined event from the class and how it is arranged.

なお、適合度＝log（適合率）と定義すると（底は１０とする）、適合率は０以上１以下の値であるため、適合度は必ず負の値となる。この場合、調整後適合度Ｙ＝調整前適合度Ｘ×調整係数Ｔとして計算してもよい。すなわち、１．０以下の調整係数Ｔを調整前適合度Ｘに乗じることにより、負の数である調整前適合度Ｘの絶対値が小さくなるため、結果として調整後適合度Ｙは調整前適合度Ｘよりも大きくなる。 Note that if the conformity is defined as log (conformance rate) (the base is 10), the conformance rate is a value between 0 and 1, and therefore the conformity is always a negative value. In this case, it may be calculated as post-adjustment fitness Y = adjustment fitness X × adjustment coefficient T That is, by multiplying the pre-adjustment fitness level X by an adjustment coefficient T of 1.0 or less, the absolute value of the negative pre-adjustment fitness level X becomes smaller. It becomes larger than degree X.

以上、本発明を実施の形態をもとに説明した。この実施の形態は例示であり、それらの各構成要素や各処理プロセスの組み合わせにいろいろな変形例が可能なこと、またそうした変形例も本発明の範囲にあることは当業者に理解されるところである。 The present invention has been described based on the embodiments. This embodiment is an exemplification, and it will be understood by those skilled in the art that various modifications can be made to combinations of the respective constituent elements and processing processes, and such modifications are also within the scope of the present invention. is there.

また、各請求項に記載の各構成要件が果たすべき機能は、本実施例において示された各機能ブロックの単体もしくはそれらの連係によって実現されることも当業者には理解されるところである。 It should also be understood by those skilled in the art that the functions to be fulfilled by the constituent elements described in the claims are realized by a single function block shown in the present embodiment or a combination thereof.

文書処理装置による処理の概要を説明するための模式図である。It is a schematic diagram for demonstrating the outline | summary of the process by a document processing apparatus. 文書処理装置の機能ブロック図である。It is a functional block diagram of a document processing apparatus. 調整処理を実行しない場合における検査処理過程を示すフローチャートである。It is a flowchart which shows the test | inspection process process in the case of not performing an adjustment process. 調整処理を実行しない場合における適合率と再現率を示すグラフ図である。It is a graph which shows a relevance rate and a reproducibility in the case of not performing adjustment processing. 調整係数Ｔを変化させたときの適合率および再現率を示すグラフ図である。It is a graph which shows a relevance rate and a recall when changing the adjustment coefficient T. 学習処理において、複数種類の調整係数に基づいて適合度を計算する処理過程を示すフローチャートである。It is a flowchart which shows the process in which a fitness is calculated based on multiple types of adjustment coefficients in a learning process. 図６に示した処理が完了した後に、暫定最適係数Ｓを求める処理過程を示すフローチャートである。7 is a flowchart showing a process for obtaining a provisional optimum coefficient S after the process shown in FIG. 6 is completed. 調整処理を実行する場合における検査処理過程を示すフローチャートである。It is a flowchart which shows the test | inspection process process in the case of performing adjustment processing. 本実施例に示した調整係数特定アルゴリズムの検証結果を示すグラフ図である。It is a graph which shows the verification result of the adjustment coefficient specific algorithm shown in the present Example.

Explanation of symbols

１００文書処理装置、１１０ユーザインタフェース処理部、１１２入力部、１１４表示部、１１６文書取得部、１２０データ処理部、１２２検査部、１２４単語抽出部、１２６適合度計算部、１２８適合度調整部、１３０データ保持部、１３２クラス分類部、１３４調整係数特定部、１３６暫定最適係数特定部、１７０クラス素性保持部、２００学習コーパス、２１０加工前検査対象文書、２１２加工済検査対象文書。 DESCRIPTION OF SYMBOLS 100 Document processing apparatus, 110 User interface processing part, 112 Input part, 114 Display part, 116 Document acquisition part, 120 Data processing part, 122 Inspection part, 124 Word extraction part, 126 Conformity calculation part, 128 Conformance adjustment part, 130 data holding unit, 132 class classification unit, 134 adjustment coefficient specifying unit, 136 provisional optimum coefficient specifying unit, 170 class feature holding unit, 200 learning corpus, 210 pre-processing inspection target document, 212 processed inspection target document

Claims

A class feature holding unit that classifies words into a plurality of classes and holds the word features in a predetermined corpus as class feature information for each class;
A document acquisition unit for acquiring a document to be inspected;
A word extraction unit for extracting words from the inspection target document;
A fitness calculation unit for calculating the fitness of the extracted word feature and class feature information in each of the plurality of classes in the document to be examined;
A fitness adjustment unit that adjusts the fitness calculated for a predetermined class of the plurality of classes;
A class classification unit that identifies a class corresponding to the extracted word based on the degree of matching of the extracted word with respect to each class;
A document processing apparatus comprising:

The document processing apparatus according to claim 1, wherein the predetermined corpus is a document group in which a class is designated in advance for a word in the document.

The document processing apparatus according to claim 2, wherein the predetermined class is a class in which words not explicitly specified in the predetermined corpus are classified.

The document processing apparatus according to claim 1, wherein the predetermined class is a class in which most words are classified in the predetermined corpus.

The fitness level adjustment unit adjusts the fitness level of the extracted word with respect to the predetermined class by executing a predetermined calculation using an adjustment coefficient as a variable with respect to the fitness level with respect to the predetermined class. 5. The document processing apparatus according to claim 1, wherein the document processing apparatus is characterized in that:

An adjustment coefficient specifying unit for specifying an adjustment coefficient value;
When identifying the adjustment factor,
The goodness-of-fit calculation unit divides the predetermined corpus into a corpus group and a check group, and sets the word features in the check group and the goodness of class feature information required for the corpus group for each of the plurality of classes. Calculated,
The fitness level adjustment unit adjusts the fitness level of the words included in the test group with respect to the predetermined class by a provisional coefficient that is a provisional adjustment coefficient,
The class classification unit specifies a class of words included in the examination group after the degree of conformity to the predetermined class is adjusted by a provisional coefficient;
The adjustment coefficient specifying unit specifies an adjustment coefficient from a plurality of types of provisional coefficients according to a degree of coincidence between a specified class and an original class in a word group included in a test group. 5. The document processing apparatus according to 5.

The adjustment coefficient specifying unit increases the accuracy of class classification by comparing words belonging to a class other than the predetermined class and words specified to belong to a class other than the predetermined class among words included in the inspection group. 7. The document processing apparatus according to claim 6, wherein the provisional coefficient when the accuracy of classification is maximized among the plurality of types of provisional coefficients is specified as an adjustment coefficient.

The document processing apparatus according to claim 7, wherein the adjustment coefficient specifying unit calculates an F value (F measure) obtained from a precision and a recall as a classification accuracy.

The adjustment coefficient specifying unit uses, as the adjustment coefficient, a provisional coefficient when the number of words whose original class matches the class specified in the word group included in the examination group is the maximum among the plurality of types of provisional coefficients. The document processing apparatus according to claim 6, wherein the document processing apparatus is specified.

The adjustment coefficient specifying unit divides the predetermined corpus into a predetermined number of groups of 2 or more, one of which is a test group and the other is a corpus group, so that the adjustment coefficient specifying unit is specified in a word group included in the test group. The predetermined number of provisional coefficients obtained by specifying a provisional optimum coefficient from a plurality of types of provisional coefficients according to the degree of coincidence between the original class and the original class, and setting each of the predetermined number of groups as an inspection group 10. The document processing apparatus according to claim 6, wherein the adjustment coefficient is determined based on the optimum coefficient.

The document processing apparatus according to claim 10, wherein the adjustment coefficient specifying unit uses an average value of the predetermined number of provisional optimum coefficients as an adjustment coefficient.

The adjustment coefficient specifying unit determines, as the adjustment coefficient, the temporary optimal coefficient when the degree of coincidence between the original class and the class specified in the word group included in the inspection group is the maximum among the predetermined number of temporary optimal coefficients. The document processing apparatus according to claim 10.

Obtaining a document to be inspected;
Extracting words from the inspection target document;
Class features information and class feature information of the extracted word in the document to be examined with reference to class feature information obtained by classifying words into a plurality of classes and classifying word features in a predetermined corpus for each class Calculating a goodness of fit for each of the plurality of classes;
Adjusting the fitness calculated for a predetermined class of the plurality of classes;
Identifying a class corresponding to the extracted word based on a degree of fitness for each class of the extracted word;
A document processing method comprising:

A function that classifies words into a plurality of classes and holds the word features of a given corpus as class feature information for each class,
A function to obtain a document to be inspected;
A function of extracting words from the inspection target document;
A function for calculating the degree of suitability of the extracted word feature and class feature information in each of the plurality of classes in the inspection target document;
A function of adjusting the degree of fitness calculated for a predetermined class of the plurality of classes;
A function for identifying a class corresponding to the extracted word based on the degree of matching of the extracted word with respect to each class;
A document processing program characterized by causing a computer to exhibit the above.