JP2002530761A

JP2002530761A - Improved part-of-speech tagging method and apparatus

Info

Publication number: JP2002530761A
Application number: JP2000582999A
Authority: JP
Inventors: カールス、アルウィン・ビー
Original assignee: ルノー・アンド・オスピー・スピーチ・プロダクツ・ナームローゼ・ベンノートシャープ; カールス、アルウィン・ビー
Priority date: 1998-11-17
Filing date: 1999-11-17
Publication date: 2002-09-17
Also published as: WO2000030070A2; CA2351404A1; EP1131812A2; WO2000030070A3; AU3789900A

Abstract

(57)【要約】テキストの品詞を識別するタグ装置であって、第１出力においてテキスト中の各用語に対する品詞タグを与える第１品詞タガーと、装置出力に結合される出力を有し、かつ、入力を有する専門的品詞タガーのセットとを有するタグ装置。専門的品詞タガーのセットは、そこへの入力において与えられる各用語に関する候補品詞タグのセットを与える。第１出力に結合された例外ハンドラは、テキスト中の各用語に応答して、テキスト中の用語が例外リストに含まれていないならば、品詞タグを第１出力から装置出力へ与える。用語が例外リストに含まれるときは、用語は専門的品詞タガーの入力に与えられる。投票手順を用いて例外リスト上の用語に関して専門的品詞タガーによって生成された候補品詞タグから１つの品詞タグを選択することができる。 (57) [Summary] A tag device for identifying a part of speech of a text, comprising: a first part of speech tagger for providing a part of speech tag for each term in the text at a first output; and an output coupled to the device output; , A set of professional part-of-speech tags with input. The set of professional part-of-speech tags provides a set of candidate part-of-speech tags for each term given in its input. An exception handler coupled to the first output, in response to each term in the text, provides a part of speech tag from the first output to the device output if the term in the text is not included in the exception list. When a term is included in the exception list, the term is given to the input of the professional part of speech tagger. One part-of-speech tag can be selected from candidate part-of-speech tags generated by the professional part-of-speech tagger for terms on the exception list using a voting procedure.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【Technical field】

本発明は、一般にテキストの品詞タグ付けに関し、特に、テキスト中の単語及
び句の文脈上の品詞を明確にすることに関する。The present invention relates generally to part-of-speech tagging of text, and more particularly to clarifying the contextual parts of speech of words and phrases in text.

【０００２】[0002]

[Background Art]

テキスト中の単語及び句の品詞を識別することは、ワードプロセッシングとテ
キストプロセッシング（例えば、校正）、情報検索と自然言語データベース質問
、情報と事実抽出、自然言語解釈、及び機械翻訳の多くの異なる分野において有
用である。品詞を識別し、そして、マルコフ (Markov) モデル、デシジョントリ
ー、コネクショニズム、変形、最隣接、オンライン学習又は最大エントロピータ
グのようなタグを付ける多くの異なる方法が存在する。これらの方法は従来技術
において詳しく説明されている。例えば、Weischedel, R., Meteer, M., Scwart
z, R., Ramshaw, L.及びPalmucci, J.による「見込みモデルを介した不明かつ未
知の単語の処理」、コンピュータ言語(computational Linguistics)１９９３；B
lack, E., Jelinek, F., Lafferty, J., Mercer, R.及びRoukos, S.による「品
詞を持つテキストのラベル付けに適用するデシジョントリーモデル」、Darpa Wo
rkshop on Speech and Natural Language (Harriman, N.Y., 1992)；Schmid, H.
による「ニュートラルネットワークに対する品詞タグ付け」、Proceedings of 1
5th International Conference on Computational Linguistic (COLING) (Yokoh
ama,Japan 1994)； Brill, E.による「変形ベースのエラー駆動学習及び自然言
語処理：品詞タグ付けのケーススタディ」コンピュータ言語(computational Lin
guistics)２１（４）、１９９５年１２月第５４３−５６５頁；Daelemans, W.,
Zavrel, P., Berck, P., Gillis, S.による「ＭＢＴ：メモリーベースの品詞タ
ガー発生器」、Proceedings of the Fourth Workshop on Vcry Large Corpora,
Copenhagen, Denmark第１４−２７頁（１９９６）；Ratnaparkhi, A.による「最
大エントロピー品詞タガー」、Proceedings of the First Empirical Methods i
n Natural Language Processing Conference, ５月１７−１８（ペンシルバニア
大学１９９６）を参照されたい。これらの文献は参照のためにここに組み入れら
れる。Identifying the parts of speech of words and phrases in text can be done in many different areas of word processing and text processing (eg, proofreading), information retrieval and natural language database queries, information and fact extraction, natural language interpretation, and machine translation. It is useful in There are many different ways to identify and tag parts of speech, such as Markov models, decision trees, connectionism, transformations, nearest neighbors, online learning or maximum entropy tags. These methods are described in detail in the prior art. For example, Weischedel, R., Meteer, M., Scwart
"Handling Unknown and Unknown Words Through Probabilistic Models" by z, R., Ramshaw, L. and Palmucci, J., Computer Linguistics 1993; B
Lacker, E., Jelinek, F., Lafferty, J., Mercer, R. and Roukos, S., "A Decision Tree Model for Labeling Text with Part of Speech", Darpa Wo.
rkshop on Speech and Natural Language (Harriman, NY, 1992); Schmid, H.
"Part-of-Speech Tagging for Neutral Networks" by Proceedings of 1
5th International Conference on Computational Linguistic (COLING) (Yokoh
ama, Japan 1994); "Transformation-based error-driven learning and natural language processing: a case study of part-of-speech tagging" by Brill, E. Computer language (computational Lin
guistics) 21 (4), December 1995, pages 543-565; Daelemans, W.,
"MBT: Memory-Based Part-of-Speech Tagger Generator" by Zavrel, P., Berck, P., Gillis, S., Proceedings of the Fourth Workshop on Vcry Large Corpora,
Copenhagen, Denmark, pp. 14-27 (1996); Ratnaparkhi, A., "Max Entropy Part-of-Speech Taggers," Proceedings of the First Empirical Methods i.
n See Natural Language Processing Conference, May 17-18 (University of Pennsylvania 1996). These documents are incorporated herein by reference.

【０００３】従来技術の最も正確な品詞タガーでさえも多少のエラーが残る結果であった。
改善された効率及び精度はより大きな、よりゆっくりした品詞タガーによって得
られるかもしれない。品詞タガーの精度を改善するために開発された別の方法が
Brill, E., Wu, J.による「改善された語彙の明確化のためのクラシファイアー
コンビネーション」、Proceedings of the 19th International Conference on
Computational Linguistics and Association for Computational Linguistics
(COLING-ACL) （カナダ、モントリオール、１９９８）及びvan Halteren, H., Z
avrel, J., Daelemans, W.による「システムコンビネーションによる改善された
データ駆動ワードクラスのタグ付け」、Proceedings of the 19th Internationa
l Conference on Computational Linguistics and Association for Computatio
nal Linguistics (COLING-ACL) （カナダ、モントリオール、１９９８）、第４
９１−４９７頁に説明されている。これらの文献は参照のため、ここに組み入れ
られる。上の文献に説明される方法は、４つの異なる品詞タガーを備える全テキ
ストを処理することを含む。品詞タグは次に、選択手順を用いて、４つの品詞タ
ガーの結果から選択される。そういった方法は精度を改善するが、それはコンピ
ュータの速度と複雑性を要する犠牲を伴う。[0003] Even the most accurate part-of-speech taggers of the prior art resulted in some errors.
Improved efficiency and accuracy may be obtained with a larger, slower part-of-speech tagger. Another method developed to improve the accuracy of part-of-speech tags is
"Classifier Combinations for Improved Vocabulary Clarification" by Brill, E., Wu, J., Proceedings of the 19th International Conference on
Computational Linguistics and Association for Computational Linguistics
(COLING-ACL) (Montreal, Canada, 1998) and van Halteren, H., Z
"Improved Data-Driven Word Class Tagging with System Combinations" by avrel, J., Daelemans, W., Proceedings of the 19th International
l Conference on Computational Linguistics and Association for Computatio
nal Linguistics (COLING-ACL) (Montreal, Canada, 1998), 4th
It is described on pages 91-497. These documents are incorporated herein by reference. The method described in the above document involves processing the entire text with four different part-of-speech tags. Part-of-speech tags are then selected from the results of the four part-of-speech tags using a selection procedure. While such methods improve accuracy, they come at the expense of computer speed and complexity.

【０００４】[0004]

DISCLOSURE OF THE INVENTION

本発明の態様によれば、テキスト中の用語の品詞を識別するタグ付け装置は、
第１品詞タガーと、専門的品詞タガーのセットと、例外ハンドラとを含む。本明
細書及び特許請求の範囲において用いられるように、語「セット」は少なくとも
１つのメンバを含むセットのことを意味する。第１品詞タガーは、第１出力にお
いて、テキスト中の各用語に対して品詞タグを与える。本明細書及び特許請求の
範囲において用いられるように、語「用語」は単語を意味し、随意に単語又は句
を意味する。換言すれば、タグ付け装置はテキスト中の各単語に対して働き、随
意にタガーはテキスト中の句に対しても同様に働く。専門的品詞タガーのセット
は、装置の出力に結合される出力を有し、また、入力を有し、該入力において提
供された各用語に対する候補品詞タグを専門的品詞タガーのセットに与える。第
１出力に結合された例外ハンドラは、テキスト中の用語が例外リストに含まれて
いないならば、テキスト中の各用語に応答して、品詞タグを第１出力から装置出
力へ与える。テキスト中の用語が例外リストに含まれていれば、その用語は専門
的品詞タガーのセットへ与えられる。According to an aspect of the present invention, a tagging device that identifies a part of speech of a term in a text comprises:
Includes a first part-of-speech tagger, a set of specialized part-of-speech tags, and an exception handler. As used herein and in the claims, the term "set" means a set that includes at least one member. The first part-of-speech tagger gives a part-of-speech tag to each term in the text in the first output. As used herein and in the claims, the term "term" means a word, and optionally a word or phrase. In other words, the tagging device works for each word in the text, and optionally the tagger works for phrases in the text as well. The set of professional part-of-speech tags has an output coupled to the output of the device and has an input, providing a set of professional part-of-speech tags for each term provided at the input. An exception handler coupled to the first output provides a part of speech tag from the first output to the device output in response to each term in the text if the term in the text is not included in the exception list. If a term in the text is included in the exception list, the term is given to a set of technical part-of-speech tags.

【０００５】別の態様において、専門的品詞タガーのセットは複数の専門的品詞タガーを含
み、タグ付け装置は専門的品詞タガーの出力に結合されるセレクタをさらに含む
。このセレクタはまた装置出力に結合される出力を有する。セレクタは、投票手
順を用いて候補品詞タグのセットから品詞タグを選択してその選択した品詞タグ
を装置出力に与える。In another aspect, the set of professional part-of-speech tags includes a plurality of professional part-of-speech tags, and the tagging device further includes a selector coupled to an output of the professional part-of-speech tags. This selector also has an output that is coupled to the device output. The selector selects a part-of-speech tag from the set of candidate part-of-speech tags using a voting procedure, and provides the selected part-of-speech tag to the device output.

【０００６】さらに別の態様において、専門的品詞タガーのセットの少なくとも１つのメン
バは例外リスト上の用語を処理するために最適化される。例外リストは、第１品
詞タガーによって生成される所定パーセントのエラーの原因である用語を含むこ
とができる。[0006] In yet another aspect, at least one member of the set of professional part-of-speech tags is optimized to handle terms on the exception list. The exception list may include terms that are responsible for a predetermined percentage of errors generated by the first part of speech tagger.

【０００７】さらに別の態様において、前記投票手順は、専門的品詞タガーのセット内の各
専門的品詞タガーの所定の特性に基づく候補品詞タグのセットから各特有の候補
品詞タグに関する得点を発生させる。投票手順は最高得点を有する品詞タグを選
択することができる。タグ付け装置は、第１品詞タガーに連結されテキストを１
セットの単語トークンに解析するトークナイザーを更に含むことができる。In yet another aspect, the voting procedure generates a score for each unique candidate part-of-speech tag from a set of candidate part-of-speech tags based on predetermined characteristics of each professional part-of-speech tagger in the set of professional part-of-speech tags. . The voting procedure may select the part of speech tag with the highest score. The tagging device is connected to the first part of speech tagger and
The system may further include a tokenizer that parses the set into word tokens.

【０００８】代替態様において、テキスト内の用語の品詞を識別する方法は、（ａ）第１品
詞タガーを用いてテキスト内の各用語の品詞を決定し、（ｂ）例外リスト内に含
まれるテキスト中の各用語を識別し、（ｃ）例外リスト内に含まれていない各用
語に関してステップ（ａ）から品詞タグを装置出力として与え、（ｄ）専門的品
詞タガーのセットを用いて例外リスト中に含まれる各用語に対して候補品詞タグ
のセットを決定するステップを含んでなる。さらに別の態様において、専門的品
詞タガーのセットが複数のタガーを含む前記方法は、さらに、（ｅ）投票手順を
用いて候補品詞タグのセットから品詞タグを選択し、（ｆ）ステップ（ｅ）にお
いて選択された品詞タグを例外リスト内に含まれる各用語に関して装置出力とし
て与えるステップを含む。In an alternative embodiment, the method of identifying the part of speech of a term in the text comprises: (a) determining the part of speech of each term in the text using a first part of speech tagger; and (b) determining the part of speech of the term in the exception list. (C) give the part-of-speech tag from step (a) as a device output for each term not included in the exception list, and (d) use the set of professional part-of-speech tags in the exception list. And determining a set of candidate part-of-speech tags for each term included in. In yet another aspect, the method wherein the set of professional part-of-speech tags includes a plurality of taggers, further comprising: (e) selecting a part-of-speech tag from the set of candidate part-of-speech tags using a voting procedure; 2) providing the part of speech tag selected in (1) as a device output for each term included in the exception list.

【０００９】さらに別の態様において、専門的品詞タガーのセットの少なくとも１つのメン
バは例外リスト上の処理用語に関して最適化される。例外リストは、ステップ（
ａ）によって生成される所定パーセントのエラーの原因となる用語を含むことが
できる。上記態様において、投票手順は、専門的品詞タガーのセット内の各専門
的品詞タグの所定の特性に基づく点数を候補品詞タグのセット各特有の候補品詞
タグに対して発生させる。候補手順は最高得点を有する候補品詞タグを選択する
ことができる。方法は、ステップ（ａ）の前に、さらにテキストを単語トークン
に解析することを含むことができる。[0009] In yet another aspect, at least one member of the set of professional part-of-speech tags is optimized for processing terms on the exception list. The exception list consists of steps (
It may include terms that cause a predetermined percentage of errors generated by a). In the above aspect, the voting procedure generates a score based on a predetermined characteristic of each specialized part of speech tag in the set of specialized part of speech tags for each unique candidate part of speech tag set of candidate part of speech tags. The candidate procedure can select the candidate part-of-speech tag with the highest score. Prior to step (a), the method may further include parsing the text into word tokens.

【００１０】別の代替態様において、命令によってエンコードされるディジタルストレージ
メディアは、コンピュータに装填されると、上に論じたいずれの装置にもなるこ
とができる。In another alternative, the digital storage media encoded by the instructions, when loaded into a computer, can be any of the devices discussed above.

【００１１】[0011]

DETAILED DESCRIPTION OF THE INVENTION

本発明は、添付図面と共に、以下に説明する詳細な説明を参照することにより
より容易に理解される。The invention will be more readily understood by reference to the following detailed description, taken in conjunction with the accompanying drawings.

【００１２】図１は、本発明の実施形態のタグ装置のブロック線図を示す。テキストは、テ
キスト入力１０において入力され、次に、トークナイザー１１を用いて単語トー
クンに解析される。トークナイザー１１は、従来技術（例えば、米国特許第５，
７２１，９３９号、「テキストをトークン化する方法及び装置」又は米国特許第
４，９９１，０９４号、「文字カテゴリー分類を用いた言語独立テキストトーク
ン化方法」）で用いられているもののいずれかにすることができる。これらの特
許の内容は、参照のためにここに組み入れられる。トークン化されたテキストは
次にタグ装置によって処理されるようにテキストバッファ内に置かれる。第１品
詞タガー１２はトークン化されたテキストを処理する。第１品詞タガー１２は、
マルコフ (Markov) モデル、デシジョントリー、コネクショニズム、変形、最隣
接、オンライン学習又は最大エントロピーといったような当該分野において一般
に用いられているものの１つとすることができる。第１品詞タガーは、早くて正
確な品詞タガーであることが好ましい。発明の１つの実施態様において、第１品
詞タガー１２は、アブニーのような（Abney-like）最終状態オートマトン（ＦＳ
Ａ）によって実行されるブリル（Brill）変形タガーである（コンピュータ言語(
computational Linguistics)２１（４）、１９９５年１２月第５４３−５６５頁
、Brill, E.の「変形ベースのエラー駆動学習及び自然言語処理：品詞タグ付け
のケーススタディ」を参照）。この文献は参照のためにここに組み入れられる。
従って、第１品詞タガー１２は、ブリル品詞タグ付け変形規則をテキストの第１
品詞タグ付きコーパスに対して発生されることによって生成される。FIG. 1 is a block diagram of a tag device according to an embodiment of the present invention. The text is entered at a text input 10 and then parsed into word tokens using a tokenizer 11. The tokenizer 11 is a conventional type (for example, US Pat.
721,939, "Methods and apparatus for tokenizing text" or U.S. Pat. No. 4,991,094, "Methods for language-independent text tokenization using character category classification"). can do. The contents of these patents are incorporated herein by reference. The tokenized text is then placed in a text buffer for processing by the tag device. The first part of speech tagger 12 processes the tokenized text. The first part of speech tagger 12 is
It may be one of those commonly used in the art, such as Markov model, decision tree, connectionism, transformation, nearest neighbor, online learning or maximum entropy. The first part of speech tagger is preferably a fast and accurate part of speech tagger. In one embodiment of the invention, the first part-of-speech tagger 12 is an Abney-like final state automaton (FS).
A) is a Brill transform tagger executed by (computer language (
Computational Linguistics) 21 (4), December 1995, pp. 543-565, Brill, E., "Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study of Part-of-Speech Tagging"). This document is incorporated herein by reference.
Therefore, the first part-of-speech tagger 12 uses the Brill part-of-speech tagging transformation rule
Generated by being generated for a part of speech tagged corpus.

【００１３】例外ハンドラ１３は第１品詞タガー１２に結合されている。処理された用語が
例外リスト上で発見されないならば、第１品詞タガーによって識別された品詞タ
グはタグ装置の出力１９となる。例外ハンドラ１３が例外リスト上で発見された
用語に出会うと、その用語は、例外ハンドラ１３に結合された専門的品詞タガー
１４−１７のセットへ送られてさらに処理される。専門的品詞タガーのセットは
ｎ個のタガーを含むことができ、ここで、ｎは１以上の数である。図１の実施形
態では、専門的品詞タガーのセットは４つの専門的品詞タガー１４−１７を有す
る。An exception handler 13 is coupled to the first part of speech tagger 12. If the processed term is not found on the exception list, the part-of-speech tag identified by the first part-of-speech tagger will be the output 19 of the tag device. When the exception handler 13 encounters a term found on the exception list, the term is sent to a set of specialized part-of-speech tags 14-17 coupled to the exception handler 13 for further processing. The set of professional part-of-speech tags can include n taggers, where n is a number greater than or equal to one. In the embodiment of FIG. 1, the set of specialized part of speech taggers has four specialized part of speech taggers 14-17.

【００１４】好ましくは、例外リストは、第１品詞タガー１２を用いて不正確なタグ付け結
果となることが知られている用語を含む。例外リストに含まれる用語は、テキス
トの第２品詞タグ付きコーパスに対して第１品詞タガー１２を走査させて第１品
詞タガー１２の残りのエラーを識別することによって識別される。第１品詞タガ
ーによって生成された最も高頻度に発生するエラーの原因となる用語を識別する
ためにエラーに関連する用語による品詞タグ付けエラーの頻度分布を発生させる
。第１品詞タガー１２によって生成される所定パーセントのエラーの原因となる
用語は例外リストに含まれる。１実施態様において、所定のパーセントは９０％
である。Preferably, the exception list includes terms that are known to result in incorrect tagging using the first part-of-speech tagger 12. Terms included in the exception list are identified by scanning the first part-of-speech tagger 12 against a second corpus-tagged corpus of text to identify remaining errors in the first part-of-speech tagger 12. Generate a frequency distribution of part-of-speech tagging errors with terms related to the errors to identify the terms that cause the most frequently occurring errors generated by the first part-of-speech tagger. Terms that cause a predetermined percentage of errors generated by the first part of speech tagger 12 are included in the exception list. In one embodiment, the predetermined percentage is 90%
It is.

【００１５】各専門的品詞タガーは、上に説明した例外リストを用いて発生させられる。各
専門的品詞タガー１４−１７は当該分野において一般的に知られたものである。
上に論じたように、品詞タガーのいくつかの例は、マルコフモデル、デシジョン
トリー、コネクショニズム、変形、最隣接、オンライン学習及び最大エントロピ
ーである。専門的品詞ガターは各タガーのスタイルに適した方法によって生成さ
れるが、各専門的品詞タガーは例外リストに含まれる用語に関して特に訓練され
る。好ましくは、各専門的品詞タガーは異なるタイプのものである。１実施形態
において、専門的品詞タガー１４−１７はトリグラム（三重字）、ブリル変形、
メモリーベース学習及び最大エントロピー品詞タガーである。Each technical part of speech tagger is generated using the exception list described above. Each specialized part of speech tagger 14-17 is generally known in the art.
As discussed above, some examples of part-of-speech tags are Markov models, decision trees, connectionism, transformations, nearest neighbors, online learning, and maximum entropy. Although the professional part-of-speech gutter is generated in a manner appropriate for each tagger's style, each professional part-of-speech tagger is specifically trained on terms included in the exception list. Preferably, each technical part of speech tagger is of a different type. In one embodiment, the professional part of speech taggers 14-17 are trigrams, trills, brill variants,
Memory-based learning and maximum entropy part-of-speech tagger.

【００１６】もしも専門的品詞タガーのセットに１つの専門的品詞タガーが存在するとき、
その専門的品詞タガーの出力は、例外リストに含まれるテキストの各用語に対す
る装置出力１９である。上に論じたように、例外リストに発見されない各用語に
関して、装置出力１９は第１品詞タガー１２の出力となる。If there is one specialized part-of-speech tagger in the set of specialized part-of-speech tags,
The output of the technical part-of-speech tagger is a device output 19 for each term in the text included in the exception list. As discussed above, for each term not found in the exception list, device output 19 is the output of first part of speech tagger 12.

【００１７】品詞タガーのセットが図１に示すように複数の専門的品詞タガーからなるなら
ば、各天文的品詞タガー１４−１７は、専門的品詞タガーのセットによって処理
された用語に関する１つの候補品詞タグを生成する。専門的品詞タガーのセット
によって生成された各候補品詞タグはセレクタ１８に与えられる。セレクタ１８
は投票手順を用いて候補品詞タグの１つを選択する。図２は本発明の実施形態に
従う投票手順を示すブロック線図である。ブロック１２において、各専門的品詞
タガーは用語を処理して候補品詞を識別する。ブロック２１において、投票手順
は専門的品詞タガーによって識別された特有の候補品詞タグのリストを創成する
。次にブロック２２において、各特有の候補品詞タグに対する得点が計算される
。If the set of part-of-speech tags is comprised of a plurality of specialized part-of-speech tags as shown in FIG. 1, each astronomical part-of-speech tagger 14-17 is a candidate for the term processed by the set of specialized part-of-speech tags. Generate a part of speech tag. Each candidate part-of-speech tag generated by the set of specialized part-of-speech tags is provided to selector 18. Selector 18
Selects one of the candidate part-of-speech tags using a voting procedure. FIG. 2 is a block diagram illustrating a voting procedure according to an embodiment of the present invention. At block 12, each specialized part of speech tagger processes the term to identify candidate parts of speech. At block 21, the voting procedure creates a list of unique candidate part of speech tags identified by the professional part of speech taggers. Next, at block 22, a score is calculated for each unique candidate part of speech tag.

【００１８】１実施形態において、投票手順は、各専門的品詞タガーのために精度及びリコ
ールに関して事前に計算された値を用いて、専門的品詞タガーのセットによって
生成された各特有の候補品詞タグに関する得点を計算する（ブロック２２）。精
度は、品詞タグによってＸのタグ付けがされかつまた訓練コーパス中でＸのタグ
付けがなされたトークンのパーセントである。リコールは訓練コーパス中におい
てＸのタグ付けがなされかつまた品詞タガーによってＸのタグ付けがされたトー
クンのパーセントであると定義される。例えば、単語「that」は、等位接続詞（
ＣＳ）、限定詞（ＤＴ）、修飾子（ＱＬ）、あるいはＷＨ代名詞（ＷＰＲ）のよ
うないくつかの品詞を有する。専門的品詞タガーが５０個の単語「that」にタグ
ＤＴを生成し、そのうちの４５個が訓練コーパスによって正しいと識別されると
、精度は０．９０（＝４５／５０）である。訓練コーパス中にＤＴのタグを付け
られた５０個の「that」があり、専門的タガーがそのうちの４８個にＤＴのタグ
を付けるならば、リコールは０．９６（＝４８／５０）である。In one embodiment, the voting procedure uses a pre-computed value for accuracy and recall for each specialized part-of-speech tagger to generate each unique candidate part-of-speech tag generated by the set of specialized part-of-speech tags. A score is calculated (block 22). Accuracy is the percentage of tokens tagged X by part of speech tags and also tagged X in the training corpus. Recall is defined as the percentage of tokens tagged X in the training corpus and also tagged X by part of speech taggers. For example, the word "that"
It has several parts of speech, such as CS), determiner (DT), qualifier (QL), or WH pronoun (WPR). If the professional part-of-speech tagger generates tags DT for the 50 words "that" and 45 of them are identified as correct by the training corpus, the accuracy is 0.90 (= 45/50). If there are 50 "that" tagged DT in the training corpus and the professional tagger tags 48 of them DT, the recall is 0.96 (= 48/50) .

【００１９】上に説明したように、ブロック２２において各特有の候補品詞タグに関する得
点を決定するために精度及びリコールを使用することができる。特定の候補品詞
タグを生成した各専門的品詞タガーの精度をその特定の候補品詞タグを生成した
各専門的品詞タガーの（１リコール）に等しい量に付け加えることによって候補
品詞タグに関する得点を計算することができる。最も高い蓄積した得点を有する
候補品詞タグは、ブロック２３において、専門的品詞タガーのセットによって処
理された用語に関する品詞タグとして選択される。As explained above, accuracy and recall can be used at block 22 to determine a score for each unique candidate part of speech tag. Compute a score for a candidate part-of-speech tag by adding the accuracy of each professional part-of-speech tagger that generated the particular candidate part-of-speech tag to an amount equal to (one recall) of each specialized part-of-speech tagger that generated the particular candidate part-of-speech tag be able to. The candidate part-of-speech tag with the highest accumulated score is selected at block 23 as the part-of-speech tag for the term processed by the set of professional part-of-speech tags.

【００２０】表１は、トリグラム（三重字）、ブリル変形、メモリーベース学習及び最大エ
ントロピー品詞タガーからなる専門的品詞タガーのセットを用いた単語「that」
に関する例の結果を示す。候補品詞タグは限定詞（ＤＴ）及び等位接続詞（ＣＳ
）として定められる。Table 1 shows the word "that" using a set of specialized part-of-speech tags that consist of trigrams, brill transforms, memory-based learning, and maximum entropy part-of-speech tags.
The results of an example for Candidate part-of-speech tags consist of a determiner (DT) and a coordinate conjunction (CS)
).

【００２１】表１任意の場合の用語「that」に関する専門的タガーの例の結果TABLE 1 Results of professional tagger examples for the term "that" in any case

【表１】 [Table 1]

【００２２】候補品詞タグ「ＤＴ」及び「ＣＳ」に関する得点計算は以下の通り。得点_ＤＴ＝０．８３＋０．８８＋（１−０．９３）＋（１−０．８９）＝１．８
９得点_ＣＳ＝０．８７＋０．９１＋（１−０．８７）＋（１−０．９３）＝１．９
８この例において、候補品詞タグＣＳはより高い得点を有し、単語「that」に関す
る品詞タグとして選択される。The score calculation for the candidate part-of-speech tags “DT” and “CS” is as follows. Score _DT = 0.83 + 0.88 + (1-0.93) + (1-0.89) = 1.8
9 score _CS = 0.87 + 0.91 + (1-0.87) + (1-0.93) = 1.9
8. In this example, the candidate part-of-speech tag CS has a higher score and is selected as the part-of-speech tag for the word "that".

【００２３】図１に戻り、タグ装置の出力１９は例外リスト上に発見されるテキスト中の各
用語に関するセレクタ１８の出力となる。さもなければ、タグ装置の出力１９は
第１品詞タガー１２の出力となる。専門的品詞タガー１４−１７を第１品詞タガ
ー１２と組み合わせて用いることにより第１品詞タガー１２の能力及び精度が改
善される。このことは、各専門的品詞タガー１４−１７を訓練して第１品詞タガ
ー１２に関して最大エラー率を生じる用語の精度を改善することによって達成さ
れる。Returning to FIG. 1, the output 19 of the tag device is the output of the selector 18 for each term in the text found on the exception list. Otherwise, the output 19 of the tag device is the output of the first part of speech tagger 12. The use of the specialized part-of-speech tagger 14-17 in combination with the first part-of-speech tagger 12 improves the performance and accuracy of the first part-of-speech tagger 12. This is accomplished by training each specialized part-of-speech tagger 14-17 to improve the accuracy of the term that produces the highest error rate for the first part-of-speech tagger 12.

【００２４】図３は、本発明の実施形態に従いテキスト中の用語の品詞を識別する方法の制
御の流れを示す。ブロック３０におけるテキストの入力はブロック３１で単語ト
ークンに解析される。トークン化されたテキストは次にブロック３２においてテ
キストバッファに置かれ、ブロック３３において第１品詞タガーによって処理さ
れる。ブロック３４において、処理された用語が例外リスト上で発見されないな
らば、ブロック３７での出力は第１品詞タガーによって生成された品詞タグとな
る。例外リストについては図１に関して上に説明した。処理された用語が例外リ
スト上に見つけられれば、その用語は専門的品詞タガーによってブロック３５に
おいて処理される。専門的品詞タガーのセットは候補品詞タグのセットを生成す
る。専門的品詞タガーのセットが単に１つの専門的品詞タガーを含むならば、ブ
ロック３７における出力は、ブロック３５において決定されるように専門的品詞
タガーの出力となる。専門的品詞タガーのセットが複数の専門的品詞タガーを含
むならば、ブロック３６において投票手順が使用されて候補品詞タグのセットか
ら１つの品詞タグが選択される。本発明の実施形態に関する投票手順は図２に関
して上に説明された。ブロック２７において、例外リスト上で見いだされたテキ
スト中の用語に関する出力はステップ３６で選択された品詞タグとなる。FIG. 3 illustrates a control flow of a method for identifying the part of speech of a term in a text according to an embodiment of the present invention. The input of text in block 30 is parsed into word tokens in block 31. The tokenized text is then placed in the text buffer at block 32 and processed at block 33 by the first part of speech tagger. At block 34, if the processed term is not found on the exception list, the output at block 37 will be the part of speech tag generated by the first part of speech tagger. The exception list has been described above with respect to FIG. If the processed term is found on the exception list, the term is processed in block 35 by the professional part of speech tagger. The set of professional part-of-speech tags creates a set of candidate part-of-speech tags. If the set of professional part-of-speech tags includes only one professional part-of-speech tagger, the output at block 37 will be the output of the professional part-of-speech tagger as determined at block 35. If the set of professional part-of-speech tags includes multiple professional part-of-speech tags, a voting procedure is used at block 36 to select one part-of-speech tag from the set of candidate part-of-speech tags. The voting procedure for an embodiment of the present invention has been described above with respect to FIG. In block 27, the output for the term in the text found on the exception list is the part of speech tag selected in step.

【００２５】本発明の種々の例示的な実施形態を説明してきたが、本発明の真の範囲から逸
脱することなく発明のいくつかの利点を達成する種々の変更又は変形が可能であ
ることは当業者にとって明白である。これらの変形及びその他の明白な変更は特
許請求の範囲によってカバーされることを意図する。While various exemplary embodiments of the present invention have been described, it is to be understood that various changes or modifications may be made to achieve some advantages of the invention without departing from the true scope of the invention. It will be apparent to those skilled in the art. These variations and other obvious modifications are intended to be covered by the appended claims.

[Brief description of the drawings]

【図１】本発明の実施態様のタグ装置のブロック線図である。FIG. 1 is a block diagram of a tag device according to an embodiment of the present invention.

【図２】本発明の実施態様の図１に示すタグ装置によって用いられる投票手順を示すブ
ロック線図である。FIG. 2 is a block diagram showing a voting procedure used by the tag device shown in FIG. 1 according to the embodiment of the present invention.

【図３】本発明の実施態様の品詞タグ付け方法の流れを示すブロック線図である。FIG. 3 is a block diagram showing a flow of a part-of-speech tagging method according to the embodiment of the present invention.

───────────────────────────────────────────────────── フロントページの続き (72)発明者カールス、アルウィン・ビーアメリカ合衆国、マサチューセッツ州 02468、バーリントン、イースト・クイノベクイン・ロード 20 Ｆターム(参考） 5B091 AA15 CA02 CC02 CC15 ──────────────────────────────────────────────────続き Continued on the front page (72) Inventor Carls, Alwin Bee, United States 02468, Mass., Burlington, East Quino Bequin Road 20F Term (Reference) 5B091 AA15 CA02 CC02 CC15

Claims

[Claims]

1. A tagging device for identifying a part of speech of a text, comprising: at a first output, a first part of speech tagger providing a part of speech for each term in the text; and an output coupled to the device output. And a set of professional part-of-speech taggers having an input, the set of professional part-of-speech tags providing a set of candidate part-of-speech tags for each term given in the input to the set of professional part-of-speech tags. An exception handler coupled to one output, wherein, in response to each term in the text, if the term in the text is not included in an exception list, a part of speech tag is removed from the first output to the device. An exception handler to provide to the output and, if the term is included in the exception list, to provide the term to the input of the technical part-of-speech tagger.

2. The apparatus of claim 1, wherein the set of specialized part-of-speech tags includes a plurality of specialized part-of-speech tags, and wherein the selector is coupled to an output of the set of specialized parts-of-speech tags. An apparatus having an output coupled to the device output, further comprising a selector for selecting a part of speech tag from the set of candidate part of speech tags using a voting procedure and providing the selected part of speech tag at the device output.

3. The apparatus according to claim 1, wherein at least one member of the set of technical part-of-speech tags is optimized for processing terms on the exception list.

4. The apparatus of claim 1, wherein the exception list comprises terms that cause a predetermined percentage of errors generated by the first part-of-speech tagger.

5. The apparatus according to claim 2, wherein the voting procedure comprises the step of: determining a characteristic of each of the set of candidate part-of-speech tags based on a predetermined characteristic of each of the set of specialized part-of-speech tags. A device that generates a score for a part of speech tag of a person.

6. The apparatus according to claim 5, wherein the voting step selects a candidate part-of-speech tag with the highest score.

7. The apparatus of claim 1, further comprising a tokenizer coupled to the first part-of-speech tagger to parse the text into a set of word tokens.

8. A method for identifying a part of speech of a text, comprising: (a) determining a part of speech of each term in the text using a first part of speech tagger; and (b) determining a part of speech of the text included in an exception list. (C) providing the part-of-speech tag from step (a) as a device output for each term not included in the exception list; and (d) using the set of specialized part-of-speech tags to define the exception. Determining a set of candidate part-of-speech tags for each term in the text included in the squirrel.

9. The method of claim 8, wherein the set of professional part-of-speech tags includes a plurality of taggers, and further comprising: (e) using a voting procedure from the set of candidate part-of-speech tags. Selecting one part-of-speech tag; and (f) providing the part-of-speech tag selected in step (e) as the device output for each term in the text included in the exception list.

10. The method of claim 8, wherein at least one member of the set of professional part-of-speech tags is optimized for processing terms on the exception list.

11. The method of claim 8, wherein the exception list comprises a predetermined percentage of error-causing terms generated by step (a).

12. The method of claim 8, wherein the voting procedure comprises: determining each of the set of candidate part-of-speech tags from the set of candidate part-of-speech tags based on predetermined characteristics of each of the set of specialized part-of-speech tags. How to generate scores for unique candidate part-of-speech tags.

13. The method of claim 8, wherein the voting procedure selects the part of speech tag with the highest score.

14. The method of claim 8, further comprising parsing the text into word tokens before step (a).

15. A digital storage medium encoded with instructions, which can be a device for identifying the part of speech of a text when loaded into a computer, wherein, at a first output, a part of speech for each term in the text. A first part-of-speech tagger having an output coupled to the device output and also having an input, wherein each of the sets of specialized part-of-speech taggers provided at the input to the set of specialized part-of-speech taggers A set of professional part-of-speech tags that provide a set of candidate part-of-speech tags for terms; and an exception handler coupled to the first output, wherein in response to each term in the text, the term in the text is an exception. If not included in the list, a part of speech tag is provided from the first output to the device output, and the term is included in the exception list. An exception handler that applies the term to the input of the professional part of speech tagger.

16. The storage medium of claim 15, wherein the set of specialized part-of-speech tags includes a plurality of specialized part-of-speech tags and is a selector coupled to an output of the set of specialized parts-of-speech tags. A storage having an output coupled to the device output, further comprising a selector for selecting a part of speech tag from the set of candidate part of speech tags using a voting procedure and providing the selected part of speech tag at the device output. media.

17. The storage medium of claim 15, wherein at least one member of the set of professional part-of-speech tags is optimized to handle terms on the exception list.

18. The storage medium of claim 15, wherein the exception list comprises a term causing a predetermined percentage of errors generated by the first part of speech tagger.

19. The storage medium according to claim 16, wherein the voting procedure comprises: determining each of the set of candidate part-of-speech tags based on predetermined characteristics of each of the set of specialized part-of-speech tags. Storage media that generates scores for unique part-of-speech tags.

20. The storage medium according to claim 19, wherein the voting step selects a candidate part-of-speech tag with the highest score.

21. The storage medium of claim 15, further comprising a tokenizer coupled to the first part of speech tagger for parsing the text into a set of word tokens.