JP2010250814A

JP2010250814A - Part-of-speech tagging system, training device and method of part-of-speech tagging model

Info

Publication number: JP2010250814A
Application number: JP2010077274A
Authority: JP
Inventors: Changjian Hu; チェンジエンフー; Zhao Kai; カイザオ; Likun Qiu; リクンチュ; Guoyang Shen; ゴゥヨンセン
Original assignee: NEC China Co Ltd
Current assignee: NEC China Co Ltd
Priority date: 2009-04-14
Filing date: 2010-03-30
Publication date: 2010-11-04
Anticipated expiration: 2030-03-30
Also published as: CN101866337A; CN101866337B; JP5128629B2

Abstract

PROBLEM TO BE SOLVED: To provide a part-of-speech tagging system, which enables part-of-speech tagging for a large tagging set and enhances part-of-speech tagging accuracy. SOLUTION: The part-of-speech tagging system 1 comprises a part-of-speech tagging model training device 12 which trains a part-of-speech tagging model for each node and each hierarchy based on a part-of-speech hierarchy tree 15 by using a text with a first tag in the part-of-speech tag training set 10, and a part-of-speech tagging device 22 which implements part-of-speech tagging for the text to be tagged by using a trained part-of-speech tagging model 13. COPYRIGHT: (C)2011,JPO&INPIT

Description

本発明は自然言語処理分野に関し、特に、品詞タグ付けシステムと品詞タグ付けモデルのトレーニング装置およびその方法に関する。 The present invention relates to the field of natural language processing, and in particular, to a part-of-speech tagging system and a part-of-speech tagging model training apparatus and method.

コンピュータ処理が可能な自然言語テキストの数は、インターネットの普及と情報化社会の進展に伴い大幅に増加している。そして、それに呼応するように、テキスト抽出、情報抽出、言語間情報処理、マンマシン対話といった大量情報を扱うアプリケーションに対する需要も急速に高まってきている。自然言語処理技術は、上記の需要に取り組む主要技術の１つである。「品詞タグ付け」とは、テキスト内の各語の正しい品詞をタグ付けすることであり、自然言語処理の基盤となるものである。品詞タグ付けの結果は通常、自然言語処理のうち、より高いレベルの処理（語の頻度の統計分析、構文、チャンク、意味解析等）に直接影響する。そのため、高効率かつ高精度な品詞タグ付け方法およびシステムを実現することがきわめて重要である。 The number of natural language texts that can be processed by computers has increased significantly with the spread of the Internet and the development of an information society. Correspondingly, the demand for applications that handle large amounts of information such as text extraction, information extraction, information processing between languages, and man-machine dialogue is also increasing rapidly. Natural language processing technology is one of the main technologies that address the above demands. “Part-of-speech tagging” is to tag the correct part-of-speech for each word in the text, which is the basis for natural language processing. The result of part-of-speech tagging typically directly affects higher level processing (word frequency statistical analysis, syntax, chunks, semantic analysis, etc.) of natural language processing. Therefore, it is extremely important to realize a method and system for tagging parts of speech with high efficiency and high accuracy.

自然言語処理においては、品詞タグ付けはシーケンスタグ付け問題の１つである。これまで、自然言語処理におけるシーケンスタグ付け問題に対処する方法として、条件付きランダム場（ＣＲＦ）が広く使用されてきた。ＣＲＦは、本質的には、条件可能性を計算するためのインディレクティブグラフモデルの一種である。条件可能性は、入力ノードの値が与えられたときに、出力ノードの値を指定するために使用される。ＣＲＦは、長距離依存性や重複等の要素の特徴を表現することができ、強いグローバル関連性を有する情報抽出において使用することが可能である。ＣＲＦを使うことで、最大エントロピー（ＭＥ）や隠れマルコフモデル（ＨＭＭ）等のディレクショナルグラフモデルにおいて強い相関の仮定を効果的に回避できるため、ディレクショナルグラフモデル内に発生するオフセットへのタグ付け問題を解決することができる。このことから、ＣＲＦはシーケンスタグ付け問題のための最良の統計学習モデルの１つとされる。効果的な品詞タグ付けモデルを得るためには、多数の特徴を導入し、大きなタグ集合を使ってトレーニングすることが必要となる。しかし、ＣＲＦのトレーニング処理には膨大な時間と計算資源が必要であり、トレーニングに必要な時間と計算資源は、タグ数の増加に伴って幾何学級数的に増大する。そのため、大きなタグ集合を扱う大規模システムアプリケーション（例えば、品詞タグ付けシステム）にＣＲＦモデルが適用されることは希である。ＣＲＦモデルは、主に、少数の特徴と小さなトレーニングコーパスを使用する用途に適用される。品詞タグ付けには比較的高い精度が要求されることを考慮すると、大きなタグ集合と特徴コーパスを使用する品詞タグ付け用途にＣＲＦモデルを適用するための方法を見つけることは緊急の課題である。 In natural language processing, part-of-speech tagging is one of the sequence tagging problems. Until now, conditional random fields (CRF) have been widely used as a way to address the sequence tagging problem in natural language processing. CRF is essentially a type of indirect graph model for calculating conditional possibilities. Conditional possibilities are used to specify the value of the output node when given the value of the input node. CRF can express characteristics of elements such as long distance dependency and overlap, and can be used in information extraction having strong global relevance. Using CRF can effectively avoid strong correlation assumptions in directional graph models such as Maximum Entropy (ME) and Hidden Markov Models (HMM), so tagging offsets that occur in the directional graph model The problem can be solved. This makes CRF one of the best statistical learning models for sequence tagging problems. In order to obtain an effective part-of-speech tagging model, it is necessary to introduce a large number of features and train using a large set of tags. However, CRF training processing requires enormous time and computational resources, and the time and computational resources required for training increase geometrically as the number of tags increases. Therefore, the CRF model is rarely applied to large-scale system applications (for example, part-of-speech tagging systems) that handle large tag sets. The CRF model is primarily applied to applications that use a small number of features and a small training corpus. Given the relatively high accuracy required for part-of-speech tagging, finding a method for applying the CRF model to part-of-speech tagging applications that use large tag sets and feature corpora is an urgent task.

上記の問題に対処するため、すでにいくつかの解決法が提案されている。例えば、非特許文献１（ＣｏｈｎＴ，ＳｍｉｔｈＡ，ＯｓｂｏｒｎｅＭ．Ｓｃａｌｉｎｇｃｏｎｄｉｔｉｏｎａｌｒａｎｄｏｍｆｉｅｌｄｓｕｓｉｎｇｅｒｒｏｒ−ｃｏｒｒｅｃｔｉｎｇｃｏｄｅｓ（誤り訂正コードを使用した条件ランダム場のスケーリング）．ＩｎＰｒｏｃ．ｔｈｅ４３ｒｄＡｎｎｕａｌＭｅｅｔｉｎｇｏｆｔｈｅＡｓｓｏｃｉａｔｉｏｎｆｏｒＣｏｍｐｕｔａｔｉｏｎａｌＬｉｎｇｕｉｓｔｉｃｓ（ＡＣＬ’０５），ＡｎｎＡｒｂｏｒ，Ｍｉｃｈｉｇａｎ：ＡｓｓｏｃｉａｔｉｏｎｆｏｒＣｏｍｐｕｔａｔｉｏｎａｌＬｉｎｇｕｉｓｔｉｃｓ，Ｊｕｎｅ２００５，ｐｐ．１０−１７．）では、ＣＲＦを大きなタグ集合に適用する方法が提案されている。この非特許文献１では、大きなタグ集合でのＣＲＦトレーニングの問題に対処するためのＥＣＯＣ（誤り訂正出力符号。これは、冗長性決定関数を定義する符号化処理と、その決定関数に基づいて最終分類関数を構築する復号化処理とで構成されるアンサンブル方法である）が紹介されている。以下に、この方法の詳細を示す。 Several solutions have already been proposed to address the above problems. For example, Non-Patent Document 1 (Cohn T, Smith A, Osborne M. Scaling conditional random fields using error-correcting codes (conditional random field scaling using error correction codes). In Proc. In Computational Linguistics (ACL'05), Ann Arbor, Michigan: Association for Computational Linguistics, June 2005, pp. 10-17.), A method of applying CRF to a large tag set is proposed. In this non-patent document 1, an ECOC (error correction output code for dealing with the problem of CRF training in a large tag set. This is an encoding process that defines a redundancy determination function and a final determination based on the determination function. This is an ensemble method composed of a decoding process for constructing a classification function). Details of this method are shown below.

モデルトレーニングフェーズ（符号化フェーズ）
１）タグ集合にｍ個のラベル（例えば、名詞をＮＮ、動詞をＶＢ、形容詞をＪＪ、副詞をＲＢとするラベル）があり、長さｎのＥＣＯＣが手動で選択されたとする。ここで、訂正符号は、以下の例に示すように、各ラベルをｎビットのベクトルにマッピングするために使用される。

Model training phase (encoding phase)
1) Assume that a tag set has m labels (for example, a noun is NN, a verb is VB, an adjective is JJ, and an adverb is RB), and an ECOC of length n is manually selected. Here, the correction code is used to map each label to an n-bit vector, as shown in the following example.

そして、上記の符号化を使用して、元のタグ付け問題（「多重分類問題」とも呼ばれる）をｎ個の独立した二値分類問題に変換する。この場合、１つの列符号化は１つの二値分類器に対応する。ブラックボックスによって選択された第３の分類器を例にとると、この分類器は、「ＮＮ」「ＪＪ」としてタグ付けされた語を、「ＶＢ」「ＲＢ」としてタグ付けされた語から区別するために使用される。 Then, using the above encoding, the original tagging problem (also called “multi-classification problem”) is converted into n independent binary classification problems. In this case, one column encoding corresponds to one binary classifier. Taking the third classifier selected by the black box as an example, this classifier distinguishes words tagged as “NN” and “JJ” from words tagged as “VB” and “RB”. Used to do.

２）これらの二値分類器のためのトレーニングコーパスを構築する（このコーパスは、元のコーパスを修正することで構築できる。これは、単に、トレーニングコーパス内のタグ付けラベルを対応する符号化の値に置換するだけでよい。例えば、第３分類器用のコーパスを構築するのであれば、必要なのは、元のコーパス内にあるすべての「ＮＮ」と「ＪＪ」を「１」に置換し、すべての「ＶＢ」と「ＲＢ」を「０」に置換することだけである）。修正済みコーパスが得られたら、従来のＣＲＦトレーニング方法を使って二値分類器のトレーニングを行う。 2) Build a training corpus for these binary classifiers (this corpus can be built by modifying the original corpus, which simply means tagging labels in the training corpus For example, if you are building a corpus for the third classifier, all you need is to replace all “NN” and “JJ” in the original corpus with “1” and all Simply replace “VB” and “RB” with “0”). Once the corrected corpus is obtained, the binary classifier is trained using conventional CRF training methods.

モデル利用フェーズ（復号化フェーズ）
１）１つの文（例：「ＮＥＣｄｅｖｅｌｏｐｓｗｏｒｌｄ−ｌｅａｄｉｎｇｔｅｃｈｎｏｌｏｇｙｔｏｐｒｅｖｅｎｔＩＰｐｈｏｎｅｓｐａｍ（ＮＥＣは世界をリードするＩＰフォンスパム防止技術を開発する）」）を与える。
２）上記でトレーニングされた各二値分類器を使用して上記の文にタグ付けし、その結果を記録する。この結果は以下のようになるはずである。

上記の表に示されるように、１つの語に対応して１つのｎビットベクトルがある。一部の従来方式では、各ベクトルを表３内の符号化ベクトルと１つずつ比較し、一致するラベルを検出してタグ付けに使用することが可能である。例えば、「ｄｅｖｅｌｏｐｓ（開発する）」という語の場合、それに対応するｎビットベクトルは「ＶＢ」の符号化に最も近いので、「ＶＢ−ｖｅｒｂ」としてタグ付けする。 Model usage phase (decryption phase)
1) Give one sentence (eg, “NEC develops world-leading technology to present IP phone spam” (NEC develops world-leading IP phone spam prevention technology)).
2) Tag each sentence using each binary classifier trained above and record the results. The result should be:

As shown in the table above, there is one n-bit vector corresponding to one word. In some conventional schemes, each vector can be compared with the encoded vector in Table 3 one by one, and the matching label can be detected and used for tagging. For example, in the case of the word “develops”, the corresponding n-bit vector is closest to the encoding of “VB” and is therefore tagged as “VB-verb”.

現在のところ、既知の技術では大きなタグ集合を持つ品詞タグ付けにＣＲＦを適用する問題に効率的に対処できないため、上記の方法は未だ以下の点で真の適用からはほど遠いのが現状である。
１）非特許文献１の方法の性能はＥＣＯＣ符号化の選択に大きく依存しているが、理想的なＥＣＯＣを選択することは困難である。
２）この方法は、本質的には、トレーニングに時間がかかり過ぎ、高価な計算資源に大きく依存するという問題を解決していない。トレーニングフェーズでは、ｎ個の二値分類器のトレーニングが必要であるが、ｎの値はＥＣＯＣの選択に左右される。品詞タグ付けでは、ｎの値はかなり大きいのが一般的なので、やはり長いトレーニング時間を要し、高価な計算資源に依存することとなる。さらに、復号化フェーズでは、すべての二値分類器を１つずつ使用する必要があり、符号化マッチング処理はきわめて煩雑である。そのため、トレーニング済みのモデルの利用には長い時間がかかり、高価な計算資源が不可欠となる。 At present, the above method is still far from true application in the following points because the known technology cannot efficiently cope with the problem of applying CRF to part-of-speech tagging with a large tag set. .
1) Although the performance of the method of Non-Patent Document 1 largely depends on the selection of ECOC encoding, it is difficult to select an ideal ECOC.
2) This method essentially does not solve the problem that training takes too much time and depends heavily on expensive computational resources. In the training phase, training of n binary classifiers is required, but the value of n depends on the choice of ECOC. In part-of-speech tagging, the value of n is generally quite large, so it still takes a long training time and depends on expensive computational resources. Further, in the decoding phase, it is necessary to use all the binary classifiers one by one, and the coding matching process is extremely complicated. Therefore, the use of a trained model takes a long time, and expensive computational resources are indispensable.

ＣｏｈｎＴ，ＳｍｉｔｈＡ，ＯｓｂｏｒｎｅＭ．Ｓｃａｌｉｎｇｃｏｎｄｉｔｉｏｎａｌｒａｎｄｏｍｆｉｅｌｄｓｕｓｉｎｇｅｒｒｏｒ−ｃｏｒｒｅｃｔｉｎｇｃｏｄｅｓ（誤り訂正コードを使用した条件ランダム場のスケーリング）．ＩｎＰｒｏｃ．ｔｈｅ４３ｒｄＡｎｎｕａｌＭｅｅｔｉｎｇｏｆｔｈｅＡｓｓｏｃｉａｔｉｏｎｆｏｒＣｏｍｐｕｔａｔｉｏｎａｌＬｉｎｇｕｉｓｔｉｃｓ（ＡＣＬ’０５），ＡｎｎＡｒｂｏｒ，Ｍｉｃｈｉｇａｎ：ＡｓｓｏｃｉａｔｉｏｎｆｏｒＣｏｍｐｕｔａｔｉｏｎａｌＬｉｎｇｕｉｓｔｉｃｓ，Ｊｕｎｅ２００５，ｐｐ．１０−１７．Cohn T, Smith A, Osborne M .; Scaling conditional random fields using error-correcting codes (conditional random field scaling using error correction codes). In Proc. the 43rd Annual Meeting of the Association for Computational Linguistics (ACL'05), Ann Arbor, Michigan: Association for Computational Linguistics, June. 10-17.

本発明は、品詞階層とカスケード化されたＣＲＦの分類・結合の技術を導入することにより、大きなタグ集合を伴う品詞タグ付けに従来型ＣＲＦを適用するという問題を解決する。本発明は、トレーニング集合から異なる品詞間の内的関係を自動的に解析し、その内的関係に基づいて品詞階層ツリーを構築して、全品詞を編成する。本発明は、この品詞階層ツリーに基づいて、各階層のタグ数を減少させるためのカスケード化ＣＲＦモデルを導入し、個々のモデル間の導入関係を指定する。本発明は最後に、大きなタグ集合の場合でも、カスケード化ＣＲＦ品詞タグ付けモデルを自動的にトレーニングし、取得することができる。トレーニング集合が潜在的に乏しいという問題を考慮して、本発明は上記に加えて、未知語を対象とした語構築ルールに基づく品詞推測モデルをトレーニングし、品詞タグ付けの精度をさらに向上させる。 The present invention solves the problem of applying conventional CRF to part-of-speech tagging with large tag sets by introducing techniques for classifying and combining CRFs cascaded with part-of-speech hierarchies. The present invention automatically analyzes internal relationships between different parts of speech from the training set, builds a part of speech hierarchy tree based on the internal relationships, and organizes all parts of speech. Based on this part-of-speech hierarchy tree, the present invention introduces a cascaded CRF model for reducing the number of tags in each hierarchy, and specifies the introduction relationship between individual models. Finally, the present invention can automatically train and obtain a cascaded CRF part-of-speech tagging model, even for large tag sets. In view of the problem that the training set is potentially scarce, in addition to the above, the present invention trains a part-of-speech inference model based on word construction rules for unknown words to further improve the accuracy of part-of-speech tagging.

本発明の第１の態様によれば、品詞タグ付けシステムであって、品詞階層ツリーに基づき、品詞タグトレーニング集合内の第１のタグ付きテキストを使用して、階層毎およびノード毎に品詞タグ付けモデルをトレーニングする品詞タグ付けモデルトレーニング装置と、トレーニング済み品詞タグ付けモデルを使用して、タグ付け対象のテキストの品詞にタグ付けする品詞タグ付け装置とを備えることを特徴とする品詞タグ付けシステムが提供される。 According to a first aspect of the present invention, there is a part-of-speech tagging system based on a part-of-speech hierarchy tree and using the first tagged text in the part-of-speech tag training set, for each part-of-speech tag and for each node. Part-of-speech tagging comprising: a part-of-speech tagging model training device for training a tagging model; and a part-of-speech tagging device for tagging the part of speech of the text to be tagged using the trained part-of-speech tagging model. A system is provided.

本発明の第２の態様によれば、品詞タグ付け方法であって、品詞階層ツリーに基づき、品詞タグトレーニング集合内の第１のタグ付きテキストを使用して、階層毎およびノード毎に品詞タグ付けモデルをトレーニングする品詞タグ付けモデルトレーニングステップと、トレーニング済み品詞タグ付けモデルを使用して、タグ付け対象のテキストの品詞にタグ付けする品詞タグ付けステップとを備えることを特徴とする品詞タグ付け方法が提供される。 According to a second aspect of the present invention, there is a part-of-speech tagging method based on a part-of-speech hierarchy tree and using the first tagged text in the part-of-speech tag training set for each part-of-speech tag and for each node. Part-of-speech tagging, comprising: a part-of-speech tagging model training step for training a tagging model; and a part-of-speech tagging step for tagging the part of speech of the text to be tagged using the trained part-of-speech tagging model. A method is provided.

本発明の第３の態様によれば、品詞タグ付けモデルのトレーニング装置であって、品詞階層ツリーに基づいて、品詞タグトレーニング集合内の第１のタグ付きテキストを第２のテキストに階層毎およびノード毎にタグ付けすることにより、ＣＲＦモデルトレーニングコーパスを構築する、ＣＲＦモデルトレーニングコーパス構築ユニットと、品詞タグ付けモデルを取得するために、ＣＲＦモデルトレーニングコーパス構築ユニットによってタグ付けされた第２のテキストを使用して、個々のＣＲＦモデルを階層毎およびノード毎にトレーニングするＣＲＦモデルトレーニングユニットとを備えることを特徴とするトレーニング装置が提供される。 According to a third aspect of the present invention, there is provided a training apparatus for a part-of-speech tagging model, wherein the first tagged text in the part-of-speech tag training set is converted into a second text for each hierarchy based on the part-of-speech tag tree. A CRF model training corpus construction unit that builds a CRF model training corpus by tagging by node and a second text tagged by the CRF model training corpus construction unit to obtain a part-of-speech tagging model Is provided with a CRF model training unit that trains individual CRF models layer by layer and node by node.

本発明の第４の態様によれば、品詞タグ付けモデルのトレーニング方法であって、品詞階層ツリーに基づいて、品詞タグトレーニング集合内の第１のタグ付きテキストを第２のテキストに階層毎およびノード毎にタグ付けすることにより、ＣＲＦモデルトレーニングコーパスを構築する、ＣＲＦモデルトレーニングコーパス構築ステップと、品詞タグ付けモデルを取得するために、ＣＲＦモデルトレーニングコーパス構築ユニットによってタグ付けされた第２のテキストを使用して、個々のＣＲＦモデルを階層毎およびノード毎にトレーニングするＣＲＦモデルトレーニングステップとを備えることを特徴とするトレーニング装置が提供される。 According to a fourth aspect of the present invention, there is provided a method for training a part-of-speech tagging model, wherein a first tagged text in a part-of-speech tag training set is classified into a second text for each hierarchy based on a part-of-speech tag tree. Build CRF model training corpus by tagging by node, CRF model training corpus construction step, and second text tagged by CRF model training corpus construction unit to obtain part-of-speech tagging model And a CRF model training step for training individual CRF models layer by layer and node by node.

本発明は、以下のように、ＣＲＦを大きなタグ集合を伴う品詞タグ付けに適用する問題を本質的に解決する。
１）本発明は、ＣＲＦモデルを大きなタグ集合を伴う品詞タグ付けに適用することを可能にすると共に、トレーニングに長い時間を要し、高価な計算資源に大きく依存するという問題を解決する。本発明の方法およびシステムによれば、機種を問わず通常の任意のＰＣコンピュータ上で品詞タグ付けモデルをトレーニングすることが可能になる。
２）以下の理由により、品詞タグ付け精度が向上する。すなわち、（ｉ）品詞の順次タグ付けはグローバル関連性の高いタスクだが、ＣＲＦモデルの導入によりグローバル最適化を効率的に実行できるため、品詞タグ付け精度が向上する。また、（ｉｉ）語構築ルールに基づく未知語のための品詞推測機構を導入することにより、トレーニング集合が乏しいという問題に対処することができ、品詞タグ付け全体の精度も向上する。
３）本発明の方法は完全に自動化されているので、品詞タグ付けモデルのトレーニングと最適化のための人件費を大幅に削減することができる。
The present invention essentially solves the problem of applying CRF to part-of-speech tagging with large tag sets as follows.
1) The present invention solves the problem that the CRF model can be applied to part-of-speech tagging with a large tag set and takes a long time for training and is highly dependent on expensive computational resources. The method and system of the present invention makes it possible to train a part-of-speech tagging model on any ordinary PC computer regardless of model.
2) Part-of-speech tagging accuracy improves for the following reasons. That is, (i) sequential part-of-speech tagging is a task with high global relevance, but global optimization can be efficiently performed by introducing a CRF model, so that part-of-speech tagging accuracy is improved. In addition, (ii) by introducing a part-of-speech estimation mechanism for unknown words based on word construction rules, it is possible to deal with the problem that training set is scarce, and the accuracy of overall part-of-speech tagging is improved.
3) Since the method of the present invention is fully automated, labor costs for training and optimizing part-of-speech tagging models can be significantly reduced.

本発明の第１の実施例による品詞タグ付けシステムの概略図である。1 is a schematic diagram of a part of speech tagging system according to a first embodiment of the present invention; 本発明の第１の実施例による品詞タグ付け方法のフローチャートである。3 is a flowchart of a part-of-speech tagging method according to the first embodiment of the present invention; 本発明による品詞階層ツリー構築装置の概略図である。1 is a schematic diagram of a part-of-speech hierarchy tree construction device according to the present invention. 本発明による品詞階層ツリー構築方法のフロー・チャートである。It is a flowchart of the method of part hierarchy tree construction method by this invention. 品詞階層ツリーの構成例である。It is a structural example of a part of speech hierarchy tree. 品詞階層ツリーのデータ構造例である。It is an example of the data structure of a part of speech hierarchy tree. 品詞階層ツリーのデータ構造例である。It is an example of the data structure of a part of speech hierarchy tree. 本発明による品詞タグ付けモデルトレーニング装置の概略ブロック図である。1 is a schematic block diagram of a part-of-speech tagging model training device according to the present invention. 本発明による品詞タグ付けモデルトレーニング方法のフロー・チャートである。3 is a flowchart of a part-of-speech tagging model training method according to the present invention. 本発明による品詞タグ付け装置の概略図である。1 is a schematic diagram of a part-of-speech tagging device according to the present invention. 本発明による品詞タグ付け方法のフロー・チャートである。3 is a flowchart of a part-of-speech tagging method according to the present invention. 本発明の第２の実施例による品詞タグ付けシステムの概略図である。It is the schematic of the part-of-speech tagging system by the 2nd Example of this invention. 本発明の第２の実施例による品詞タグ付け方法のフローチャートである。It is a flowchart of the part-of-speech tagging method by the 2nd Example of this invention. 本発明の第３の実施例による品詞タグ付けシステムの概略図である。It is the schematic of the part-of-speech tagging system by the 3rd Example of this invention. 本発明の第３の実施例による品詞タグ付け方法のフローチャートである。It is a flowchart of the part of speech tagging method by 3rd Example of this invention.

次に、図を参照して、本発明の好適な実施例について説明する。なお、同じ参照記号または番号が異なる図で使用されている場合は、同一もしくは類似の構成要素であることを示す。以下では、本発明の主題が曖昧となるのを避けるため、既知の機能および構成の詳細な説明は省略している。 Next, preferred embodiments of the present invention will be described with reference to the drawings. In addition, when the same reference symbol or number is used in different drawings, it indicates the same or similar component. In the following, detailed descriptions of known functions and configurations are omitted to avoid obscuring the subject matter of the present invention.

図１は、本発明の第１の実施例による品詞タグ付けシステムの概略図である。品詞タグ付けシステム１において、品詞タグトレーニング集合１０は、多数のタグ付きテキスト（すなわち、タグ付きテキスト集合）から成る。品詞階層ツリー構築装置１４は、品詞タグトレーニング集合１０内のタグ付きテキストに基づいて、異なる品詞間の関連性を分析し、分析された関連性に基づいて品詞階層ツリー１５を構築して、品詞タグトレーニング集合１０内に存在するタグ付き品詞を階層状に編成する。ここで、関連性は例えば品詞間の類似性としてもよい。品詞タグ付けモデルトレーニング装置１２は、トレーニングを行って品詞タグ付けモデル１３を生成する。品詞タグ付けモデルトレーニング装置１２は品詞タグトレーニング集合１０からタグ付きテキストを読み取り、品詞階層ツリー１５内の品詞階層に関する情報に基づいて、品詞タグ付け用ＣＲＦ品詞タグ付けモデル１３をトレーニングするためのモデルトレーニング処理を構築する。生成された品詞タグ付けモデルは、カスケード化された品詞タグ付けモデルである。品詞タグ付け装置２２は、生成された品詞タグ付けモデルに基づいて、任意の非タグ付きテキスト内の語に対して品詞タグ付けを実行する。 FIG. 1 is a schematic diagram of a part of speech tagging system according to a first embodiment of the present invention. In the part-of-speech tagging system 1, the part-of-speech tag training set 10 is composed of a large number of tagged texts (ie, tagged text sets). The part-of-speech hierarchy tree construction device 14 analyzes the relation between different parts of speech based on the tagged text in the part-of-speech tag training set 10, and constructs the part-of-speech hierarchy tree 15 based on the analyzed relation. Tagged part-of-speech that exists in the tag training set 10 is organized in a hierarchy. Here, the relationship may be a similarity between parts of speech, for example. The part-of-speech tagging model training device 12 performs training to generate a part-of-speech tagging model 13. The part-of-speech tagging model training device 12 reads the tagged text from the part-of-speech tag training set 10, and trains the CRF part-of-speech tagging model 13 for part-of-speech tagging based on information about the part-of-speech hierarchy in the part-of-speech hierarchy tree 15. Build a training process. The generated part of speech tagging model is a cascaded part of speech tagging model. The part-of-speech tagging device 22 performs part-of-speech tagging for words in any untagged text based on the generated part-of-speech tagging model.

図１には品詞階層ツリー構築装置１４を備える品詞タグ付けシステムが示されているが、品詞階層ツリー構築装置を含まない品詞タグ付けシステムも可能である。この場合は、予め構築された品詞階層ツリーを使用して、非タグ付きテキストに品詞をタグ付けする。品詞階層ツリーは、例えば、手動で構築された階層ツリーであってもよい。さらに、品詞タグ付けシステムは、品詞タグ付け用の品詞タグ付けモデル１３を生成する、品詞タグ付けモデルトレーニング装置１２のみを備えることもできる。 Although the part-of-speech tagging system including the part-of-speech hierarchy tree construction device 14 is shown in FIG. In this case, the part of speech is tagged to the untagged text using a pre-built part of speech hierarchy tree. The part of speech hierarchy tree may be, for example, a manually constructed hierarchy tree. Furthermore, the part-of-speech tagging system may include only a part-of-speech tagging model training device 12 that generates a part-of-speech tagging model 13 for part-of-speech tagging.

品詞階層ツリー１５は、品詞をツリー構造として階層状に編成する。図４ａは、品詞階層ツリーの一例である。この品詞階層ツリーは、階層０、１、２、３の計４階層を有し、階層２と階層３は各々６ノードずつ有する。品詞階層ツリーの葉ノードは真の品詞に対応し、その他のノードは無作為に定義されたダミーのクラス名である。図４ｂおよび図４ｃは、図４ａに示す品詞階層ツリーのデータ構造の例である。 The part of speech hierarchy tree 15 organizes parts of speech as a tree structure in a hierarchical manner. FIG. 4a is an example of a part of speech hierarchy tree. This part-of-speech hierarchy tree has four hierarchies of hierarchies 0, 1, 2, and 3, and hierarchies 2 and 3 each have six nodes. The leaf nodes of the part-of-speech hierarchy tree correspond to true part-of-speech, and the other nodes are randomly defined dummy class names. 4b and 4c are examples of the data structure of the part of speech hierarchy tree shown in FIG. 4a.

図１ｂは、品詞タグ付け方法のフローチャートである。Ｓ１０１において、品詞階層ツリー構築装置１４は品詞階層ツリー１５を構築して、品詞タグトレーニング集合内に存在するタグ付き品詞を階層状に編成する。Ｓ１０２において、品詞タグ付けモデルトレーニング装置１２は品詞タグトレーニング集合１０からタグ付きテキストを読み取り、品詞階層ツリー１５内の品詞階層に関する情報に基づいて、品詞タグ付け用の品詞タグ付けモデル１３を生成する。この品詞タグ付けモデルは、カスケード化された品詞タグ付けモデルである。Ｓ１０３において、品詞タグ付け装置２２は、生成された品詞タグ付けモデル１３を使用して、入力されたテキストに対し品詞タグ付けを実行する。 FIG. 1b is a flowchart of the part-of-speech tagging method. In S101, the part-of-speech hierarchy tree construction device 14 constructs the part-of-speech hierarchy tree 15 and organizes the tagged parts of speech existing in the part-of-speech tag training set into a hierarchy. In S102, the part-of-speech tagging model training device 12 reads the tagged text from the part-of-speech tag training set 10, and generates a part-of-speech tagging model 13 for part-of-speech tagging based on information about the part-of-speech hierarchy in the part-of-speech hierarchy tree 15. . This part of speech tagging model is a cascaded part of speech tagging model. In S103, the part-of-speech tagging device 22 performs part-of-speech tagging on the input text using the generated part-of-speech tagging model 13.

次に、図２と図３を参照して、品詞階層ツリー１５の構築について説明する。 Next, the construction of the part-of-speech hierarchy tree 15 will be described with reference to FIGS.

図２は、本発明の品詞階層ツリー構築装置１４の概略図である。この図においては、品詞特徴テンプレート選択ユニット１４０が、品詞の文法的挙動を表現する品詞特徴テンプレートを選択する。品詞の文法的挙動は様々な方法で表現することができる。品詞特徴テンプレートとして選択できる特徴の一例としては、タグ付きテキスト内における現在の語の直前の語と、当該直前の語の品詞と、現在の語の直後の語と、当該直後の語の品詞とを含むものが挙げられる。特徴ベクトル構築ユニット１４１は、選択された品詞特徴テンプレートに基づいて、品詞タグトレーニング集合１０内に存在する各品詞の特徴ベクトルを構築する。類似度計算ユニット１４２は、構築された特徴ベクトルを使用して、品詞タグトレーニング集合１０に含まれる任意の２つの品詞間の類似度を計算する。クラスタ化ユニット１４３は、計算された類似度に基づき、従来の階層クラスタ化アルゴリズムを使用して、品詞タグトレーニング集合１０内のすべての品詞をクラスタ化し、予め設定されたルールに従って品詞階層ツリー１５を生成する。 FIG. 2 is a schematic diagram of the part-of-speech hierarchy tree construction device 14 of the present invention. In this figure, the part of speech feature template selection unit 140 selects a part of speech feature template that expresses the grammatical behavior of the part of speech. The grammatical behavior of parts of speech can be expressed in various ways. Examples of features that can be selected as part-of-speech feature templates include: the word immediately before the current word in the tagged text, the part of speech of the previous word, the word immediately after the current word, and the part of speech of the immediately following word. The thing containing is mentioned. The feature vector construction unit 141 constructs a feature vector for each part of speech that exists in the part of speech tag training set 10 based on the selected part of speech feature template. The similarity calculation unit 142 calculates a similarity between any two parts of speech included in the part of speech tag training set 10 using the constructed feature vector. Clustering unit 143 clusters all of the parts of speech in part-of-speech tag training set 10 based on the calculated similarity, using a conventional hierarchical clustering algorithm, and creates a part-of-speech hierarchy tree 15 according to preset rules. Generate.

図３は、品詞階層ツリー構築装置が品詞階層ツリーを生成する処理のフローチャートである。Ｓ３０１において、品詞特徴テンプレート選択ユニット１４０は品詞特徴群を品詞特徴テンプレートとして選択する。このとき、例えば、タグ付きテキスト内における現在の語の直前の語と、当該直前の語の品詞と、現在の語の次の語と、当該次の語の品詞、というように選択する。例えば、「香港／ｎｓ評出／ｖ十／ｍ大／ａ傑出／ａ青年／ｎ」というタグ付きテキストにおいて、語「評出」を現在の語として選択したとすると、現在の語の品詞は「ｖ」である。この場合の品詞特徴群は、以下のように表現される。

FIG. 3 is a flowchart of a process in which the part-of-speech hierarchy tree construction device generates a part-of-speech hierarchy tree. In S301, the part of speech feature template selection unit 140 selects a part of speech feature group as a part of speech feature template. At this time, for example, the word immediately before the current word in the tagged text, the part of speech of the previous word, the next word of the current word, and the part of speech of the next word are selected. For example, in the tagged text “Hong Kong / ns review / v 10 / m large / a outstanding / a youth / n”, if the word “review” is selected as the current word, the part of speech of the current word is “V”. The part-of-speech feature group in this case is expressed as follows.

Ｓ３０２において、特徴ベクトル構築ユニット１４１は、品詞特徴テンプレートに基づいて、品詞タグトレーニング集合１０内の各品詞の特徴ベクトルを構築する。例えば、品詞タグトレーニング集合内には合計ｄｚ個の語とｌｚ個の品詞がある。ユニット１４１は、上記で選択された品詞特徴群を与えられると、任意の品詞ｘについて以下のベクトルを構築することができる。
１）ｘ＜ｐｒｅｖｉｏｕｓｗｏｒｄ＞直前の語ベクトル−このベクトルはｄｚの寸法を有し、当該ベクトルに対応する要素は、指定された語が品詞ｘの語の直前に出現する頻度を表す。
２）ｘ＜ｐｒｅｖｉｏｕｓｗｏｒｄ’ｓＰＯＳ＞直前の語の品詞ベクトル−このベクトルはｌｚの寸法を有し、当該ベクトルに対応する要素は、指定された品詞が品詞ｘの語の直前に出現する頻度を表す。
３）ｘ＜ｎｅｘｔｗｏｒｄ＞次の語ベクトル−このベクトルはｄｚの寸法を有し、当該ベクトルに対応する要素は、指定された語が品詞ｘの語の直後に出現する頻度を表す。
４）ｘ＜ｎｅｘｔｗｏｒｄ’ｓＰＯＳ＞次の語の品詞ベクトル−このベクトルはｌｚの寸法を有し、当該ベクトルに対応する要素は、指定された品詞が品詞ｘの語の直後に出現する頻度を表す。 In S302, the feature vector construction unit 141 constructs a feature vector for each part of speech in the part of speech tag training set 10 based on the part of speech feature template. For example, there are a total of dz words and lz parts of speech in the part of speech tag training set. Unit 141, given the part of speech feature group selected above, can construct the following vector for any part of speech x.
1) x <previous word> immediately preceding word vector—this vector has a dimension of dz, and the element corresponding to that vector represents the frequency with which the specified word appears immediately before the word of part of speech x.
2) x <previous word's POS> The part of speech vector of the immediately preceding word—this vector has a dimension of lz, and the element corresponding to the vector is the frequency at which the specified part of speech appears immediately before the word of part of speech x Represents.
3) x <next word> Next word vector—This vector has a dimension of dz, and the element corresponding to the vector represents the frequency with which the specified word appears immediately after the word of part of speech x.
4) x <next word's POS> Part-of-speech vector of next word—This vector has a dimension of lz, and the element corresponding to the vector is the frequency at which the specified part-of-speech appears immediately after the part-of-speech x word. Represents.

Ｓ３０３において、類似度計算ユニット１４２は、例えばｘ１およびｘ２について以下のステップを実行して、品詞タグトレーニング集合１０に含まれる任意の２つの品詞間の類似度を計算する。
１）最初に、２つの品詞（ｘ１，ｘ２）の特徴ベクトルについて、各対間における以下の類似度を計算する。
Ｓｉｍｃ（ｘ１＜ｐｒｅｖｉｏｕｓｗｏｒｄ＞，ｘ２＜ｐｒｅｖｉｏｕｓｗｏｒｄ＞）
Ｓｉｍｃ（ｘ１＜ｐｒｅｖｉｏｕｓｗｏｒｄ’ｓＰＯＳ＞，ｘ２＜ｐｒｅｖｉｏｕｓｗｏｒｄ’ｓＰＯＳ＞）
Ｓｉｍｃ（ｘ１＜ｎｅｘｔｗｏｒｄ＞，ｘ２＜ｎｅｘｔｗｏｒｄ＞）
Ｓｉｍｃ（ｘ１＜ｎｅｘｔｗｏｒｄ’ｓＰＯＳ＞，ｘ２＜ｎｅｘｔｗｏｒｄ’ｓＰＯＳ＞）
２）以下の式を使用して、全体的な類似度を計算する。
Ｓｉｍ（ｘ１，ｘ２）＝ｗ１＊Ｓｉｍｃ（ｘ１＜ｐｒｅｖｉｏｕｓｗｏｒｄ＞，ｘ２＜ｐｒｅｖｉｏｕｓｗｏｒｄ＞）＋
ｗ２＊Ｓｉｍｃ（ｘ１＜ｐｒｅｖｉｏｕｓｗｏｒｄ’ｓＰＯＳ＞，ｘ２＜ｐｒｅｖｉｏｕｓｗｏｒｄ’ｓＰＯＳ＞）＋
ｗ３＊Ｓｉｍｃ（ｘ１＜ｎｅｘｔｗｏｒｄ＞，ｘ２＜ｎｅｘｔｗｏｒｄ＞）＋
ｗ４＊Ｓｉｍｃ（ｘ１＜ｎｅｘｔｗｏｒｄ’ｓＰＯＳ＞，ｘ２＜ｎｅｘｔｗｏｒｄ’ｓＰＯＳ＞），
ここで、ｗ１＋ｗ２＋ｗ３＋ｗ４＝１である。 In S 303, the similarity calculation unit 142 calculates the similarity between any two parts of speech included in the part of speech tag training set 10 by performing the following steps, for example, for x 1 and x 2.
1) First, for the feature vectors of two parts of speech (x1, x2), the following similarity between each pair is calculated.
Simc (x1 <previous word>, x2 <previous word>)
Simc (x1 <previous word's POS>, x2 <previous word's POS>)
Simc (x1 <next word>, x2 <next word>)
Simc (x1 <next word's POS>, x2 <next word's POS>)
2) Calculate the overall similarity using the following formula:
Sim (x1, x2) = w1 * Simc (x1 <previous word>, x2 <previous word>) +
w2 * Simc (x1 <previous word's POS>, x2 <previous word's POS>) +
w3 * Simc (x1 <next word>, x2 <next word>) +
w4 * Simc (x1 <next word's POS>, x2 <next word's POS>),
Here, w1 + w2 + w3 + w4 = 1.

Ｓ３０４において、クラスタ化ユニット１４３は、計算された類似度に基づき、階層クラスタ化アルゴリズム（例えば、Ｋ平均法アルゴリズム）を使用して、すべての品詞をクラスタ化し、事前に設定されたルールに基づいて階層ツリーを生成する。本発明においては、事前に設定されたルールには、「各階層のノード数はｎ未満であること（ｎは正の整数）」のような定義を含めてもよい。この場合、例えばｎを８とすることができる。 In S304, the clustering unit 143 uses a hierarchical clustering algorithm (eg, K-means algorithm) to cluster all parts of speech based on the calculated similarity and based on pre-set rules. Generate a hierarchical tree. In the present invention, the rule set in advance may include a definition such as “the number of nodes in each layer is less than n (n is a positive integer)”. In this case, for example, n can be set to 8.

以下では、図５ａと図５ｂを参照して、品詞タグ付けモデルの生成について説明する。図５ａは、本発明による品詞タグ付けモデルトレーニング装置１２のブロック図である。品詞タグ付けモデルトレーニング装置１２は、ＣＲＦモデルトレーニングコーパス構築ユニット１２１と、ＣＲＦモデルトレーニングユニット１２２と、論理回路１２０とを備える。ＣＲＦモデルトレーニングコーパス構築ユニット１２１は、品詞階層ツリー１５に基づいて、品詞タグトレーニング集合１０から階層毎およびノード毎に読み取られたトレーニングテキストに対して品詞タグ付けを実行する。ＣＲＦモデルトレーニングユニット１２２は、ＣＲＦモデルトレーニングコーパス構築ユニット１２１によってタグ付けされたトレーニングテキストに基づいて、対応する階層毎およびノード毎にＣＲＦモデルをトレーニングする。論理回路１２０は、品詞タグ付けモデルのトレーニング処理において、ＣＲＦモデルトレーニングコーパス構築ユニット１２１とＣＲＦモデルトレーニングユニットとを制御する。論理回路１２０は、品詞階層ツリーの階層数を保持しており、ＣＲＦモデルトレーニングコーパス構築ユニット１２１とＣＲＦモデルトレーニングユニットが１つの階層を処理する毎に階層数を増分し、品詞階層ツリーの最後の階層のすべてのノードが処理されるまでこれを継続する。 In the following, the generation of a part-of-speech tagging model will be described with reference to FIGS. 5a and 5b. FIG. 5a is a block diagram of the part-of-speech tagging model training device 12 according to the present invention. The part-of-speech tagging model training apparatus 12 includes a CRF model training corpus construction unit 121, a CRF model training unit 122, and a logic circuit 120. The CRF model training corpus construction unit 121 performs part-of-speech tagging on the training text read from the part-of-speech tag training set 10 for each hierarchy and each node based on the part-of-speech hierarchy tree 15. The CRF model training unit 122 trains the CRF model for each corresponding hierarchy and node based on the training text tagged by the CRF model training corpus construction unit 121. The logic circuit 120 controls the CRF model training corpus construction unit 121 and the CRF model training unit in the part-of-speech tagging model training process. The logic circuit 120 holds the number of layers of the part-of-speech hierarchy tree, and increments the number of layers every time the CRF model training corpus construction unit 121 and the CRF model training unit process one hierarchy, This continues until all nodes in the hierarchy have been processed.

図５ｂは、品詞タグ付けモデルトレーニング装置が品詞タグ付けモデルを生成する処理のフローチャートである。これは２層ループを含む入れ子式のトレーニング方法であり、上から下に向かうトレーニングモードが採用されている。そのため、１つの階層のトレーニング結果は次の階層に影響を及ぼし、同じ階層のトレーニングは独立して実行することができる。品詞階層ツリーはｎ階層で構成され、階層ｉにはｍ_ｉ個のノードがあり、現在のノードを「ｊ」と呼ぶこととする。最初に、論理回路１２０が、Ｓ６０１において階層ｉに「０」の値を割り当て、Ｓ６０２においてノードｊに「１」の値を割り当てる。続いて、Ｓ６０３において、ＣＲＦモデルトレーニングコーパス構築ユニット１２１が＜ｉ，ｊ＞ＣＲＦモデル用のトレーニングコーパスを構築し、原形の品詞タグトレーニング集合１０内のタグ付きテキストに含まれる品詞タグ付けラベルを、品詞階層ツリー内においてそのラベルに対応する、現在のノードの各サブノード名に置換する。Ｓ６０４において、ＣＲＦモデルトレーニングユニット１２２が、＜ｉ，ｊ＞ＣＲＦモデルトレーニングコーパスと選択された特徴テンプレートとを用いて、＜ｉ，ｊ＞ＣＲＦモデルをトレーニングする。ここで、ｉ＝０の場合は、ＣＲＦモデルトレーニングユニット１２２によって選択された特徴テンプレートは、前後２つの語と、現在の語の前後の文字と、前後２語間の共起頻度とを含む。そして、ｉ＞０においては、階層０で使用された特徴テンプレートに加えて、直前の階層のタグ付け結果に示される前後２つの語の品詞と、品詞間の共起と、語および品詞間の共起とを含む特徴テンプレートも同時に使用される。Ｓ６０５においてｊの値が増分され、Ｓ６０６において、ｊがｍ_ｉより大きいかどうかが判定される。ｊがｍ_ｉより小さい場合には、処理はＳ６０３に進む。ｊがｍ_ｉより大きい場合には、ｉの値がＳ６０７において増分され、処理はＳ６０２に進み、品詞階層ツリーのすべての階層のノードがＳ６０３とＳ６０４を終了するまで処理が継続される。このようにして、大規模なタグ集合においても、カスケード化品詞タグ付けモデルをトレーニングすることができる。 FIG. 5b is a flowchart of a process in which the part-of-speech tagging model training device generates a part-of-speech tagging model. This is a nested training method including a two-layer loop, and a training mode from top to bottom is employed. Therefore, the training result of one layer affects the next layer, and the training of the same layer can be executed independently. Part of speech hierarchical tree is constituted by n hierarchy, the hierarchy i have m _i-number of nodes, and that the current node is referred to as "j". First, the logic circuit 120 assigns a value of “0” to the hierarchy i in S601, and assigns a value of “1” to the node j in S602. Subsequently, in S603, the CRF model training corpus construction unit 121 constructs a training corpus for the <i, j> CRF model, and the part-of-speech tagging labels included in the tagged text in the original part-of-speech tag training set 10 are: Replace each subnode name of the current node corresponding to that label in the part of speech hierarchy tree. In S604, the CRF model training unit 122 trains the <i, j> CRF model using the <i, j> CRF model training corpus and the selected feature template. Here, when i = 0, the feature template selected by the CRF model training unit 122 includes two words before and after, the characters before and after the current word, and the co-occurrence frequency between the two words before and after. When i> 0, in addition to the feature template used at level 0, the part-of-speech of the two words before and after the tagging result of the previous level, the co-occurrence between parts of speech, and between the word and part-of-speech Feature templates including co-occurrence are also used at the same time. S605 the value of j is incremented at, in S606, j whether greater than _{m i} is determined. If j is _{m i} is smaller than, the process proceeds to S603. If j is greater than _{m i,} the value of i is incremented in S607, the process proceeds to S602, the processing to the node of all layers of the part of speech hierarchical tree has finished S603 and S604 is continued. In this way, cascaded part-of-speech tagging models can be trained even in large tag sets.

ここで、十分なタグ付けがなされた以下の文を一例として取り上げる。
香港／ｎｓ評出／ｖ十／ｍ大／ａ傑出／ａ青年／ｎ
階層０において、＜０，１＞ＣＲＦモデルトレーニングコーパスが構築される。最初に、この文の再タグ付けが行われる。図４ａに示す品詞階層ツリーを参照すると、階層０におけるノード１のサブノードは「ｌａｂｅｌ１」、「ｌａｂｅｌ２」、「ｌａｂｅｌ３」、「ｌａｂｅｌ４」である。図４ａの真の品詞「ｖ」は、品詞階層ツリーの第１階層におけるサブノードの「ｌａｂｅｌ１」に対応する。したがって、原形のトレーニング集合内の「ｖ」でタグ付けされたすべての語は、「ｌａｂｅｌ１」として再タグ付けされる。 Here is an example of the following sentence that has been fully tagged:
Hong Kong / ns Evaluation / v Ten / m Large / a Outstanding / a Youth / n
At level 0, a <0, 1> CRF model training corpus is constructed. First, the sentence is re-tagged. Referring to the part-of-speech hierarchy tree shown in FIG. 4a, the subnodes of node 1 in hierarchy 0 are “label1”, “label2”, “label3”, and “label4”. The true part-of-speech “v” in FIG. 4a corresponds to the sub-node “label1” in the first hierarchy of the part-of-speech hierarchy tree. Thus, all words tagged with “v” in the original training set are retagged as “label1”.

階層０において再タグ付けされた文は、以下のようになる。
香港／ｌａｂｅｌ３評出／ｌａｂｅｌ１十／ｌａｂｅｌ２大／ｌａｂｅｌ１傑出／ｌａｂｅｌ１青年／ｌａｂｅｌ３
階層０において、ＣＲＦモデルがトレーニングされる。選択された特徴テンプレートは、「香港」と「評出」のような前後２つの語と、現在の語の前後の文字と、前後２語間の共起とを含む（ここで、「共起」とは、ある文脈において２語が同時に出現する状況を意味する）。 The sentence re-tagged in hierarchy 0 is as follows:
Hong Kong / label3 review / label1 ten / label2 large / label1 standout / label1 youth / label3
In Tier 0, the CRF model is trained. The selected feature template includes two words before and after “Hong Kong” and “evaluation”, characters before and after the current word, and co-occurrence between the two words before and after (where “co-occurrence” "Means a situation in which two words appear simultaneously in a context).

その後、上記の文が階層１において再度、再タグ付けされる。階層１の１番目のノード＜１，１＞に関して、＜１，１＞ＣＲＦモデルトレーニングコーパスが構築される。図４ａの品詞階層ツリーを参照すると、ノード＜１，１＞は「ｌａｂｅｌ１１」「ｌａｂｅｌ１２」というサブノードを有している。したがって、階層０で「ｌａｂｅｌ１」でタグ付けされた語はさらに「ｌａｂｅｌ１１、ｌａｂｅｌ１２」（すなわち、現在のノードのサブノード名の集合）でタグ付けされる。 Thereafter, the above sentence is re-tagged again in hierarchy 1. For the first node <1,1> in layer 1, a <1,1> CRF model training corpus is constructed. Referring to the part-of-speech hierarchy tree of FIG. 4a, the node <1,1> has sub-nodes “label11” and “label12”. Therefore, the word tagged with “label1” in hierarchy 0 is further tagged with “label11, label12” (ie, a set of sub-node names of the current node).

階層０のタグ付け結果であった「香港／ｌａｂｅｌ３評出／ｌａｂｅｌ１十／ｌａｂｅｌ２大／ｌａｂｅｌ１傑出／ｌａｂｅｌ１青年／ｌａｂｅｌ３」は、ノード＜１，１＞の再タグ付け後には、「香港／ｌａｂｅｌ３評出／ｌａｂｅｌ１２十／ｌａｂｅｌ２大／ｌａｂｅｌ１１傑出／ｌａｂｅｌ１１青年／ｌａｂｅｌ３」となる。 “Hong Kong / label3 reputation / label1 10 / label2 large / label1 prominent / label1 youth / label3”, which was the result of tagging at level 0, is “Hong Kong / label3 rating” after re-tagging node <1,1>. Out / label12 + 10 / label2 large / label11 Outstanding / label11 youth / label3 ”.

その後、ノード＜１，１＞についてＣＲＦモデルトレーニングが実行される。選択された特徴テンプレートは、上記の階層０の特徴テンプレートに加えて、直前の階層のタグ付け結果に含まれる前後２語の品詞と、品詞間の共起と、語と品詞間の共起とを含む。例えば、「評出」という語の場合、特徴テンプレートは、前後の２語「香港」および「十」の品詞である「ｌａｂｌｅ３」および「ｌａｂｅｌ２」と、これらの品詞間の共起と、語と品詞間の共起とを含む。 Thereafter, the CRF model training is executed for the node <1, 1>. The selected feature template includes, in addition to the above-described feature template at level 0, two parts of speech before and after included in the tagging result of the immediately preceding layer, co-occurrence between parts of speech, and co-occurrence between words and parts of speech. including. For example, in the case of the word “evaluation”, the feature template includes “label3” and “label2” which are parts of speech of two words “Hong Kong” and “ten” before and after, the co-occurrence between these parts of speech, Including co-occurrence between parts of speech.

同様に、ノード＜１，２＞、ノード＜１，３＞、ノード＜１，４＞の各々に対して上記のＣＲＦモデルトレーニングコーパス構築処理とＣＲＦモデルトレーニング処理が実行され、すべての階層のノードがＣＲＦモデルトレーニングコーパス構築処理とＣＲＦモデルトレーニング処理を終了するまでこれが継続される。 Similarly, the above-described CRF model training corpus construction process and CRF model training process are executed for each of the nodes <1, 2>, <1, 3>, and <1, 4>, and the nodes of all layers This is continued until the CRF model training corpus construction process and the CRF model training process are completed.

図６ａは、品詞タグ付け装置のブロック図である。品詞タグ付け装置２２は、論理回路２２２と、ＣＲＦモデル特徴構築ユニット２２０と、ＣＲＦ品詞タグ付けユニット２２１とを備える。論理回路２２２は、品詞タグ付け処理の実行中に、カスケード化された品詞タグ付けモデルに従ってＣＲＦモデル特徴構築ユニット２２０とＣＲＦ品詞タグ付けユニット２２１とを制御する。ＣＲＦモデル特徴構築ユニット２２０は、論理回路２２２の制御のもとで、タグ付け対象のテキスト用として、＜ｉ，ｊ＞ＣＲＦモデルを利用するための特徴群を階層毎およびノード毎に構築する。ＣＲＦ品詞タグ付けユニット２２１は、論理回路２２２の制御のもとで、特徴構築ユニット２２０によって構築された特徴データに基づいて、対応する階層毎およびノード毎に品詞タグ付けを実行する。 FIG. 6a is a block diagram of a part-of-speech tagging device. The part of speech tagging device 22 includes a logic circuit 222, a CRF model feature construction unit 220, and a CRF part of speech tagging unit 221. The logic circuit 222 controls the CRF model feature construction unit 220 and the CRF part-of-speech tagging unit 221 according to the cascaded part-of-speech tagging model during execution of the part-of-speech tagging process. Under the control of the logic circuit 222, the CRF model feature construction unit 220 constructs a feature group for using the <i, j> CRF model for each layer and each node for text to be tagged. The CRF part-of-speech tagging unit 221 executes part-of-speech tagging for each corresponding hierarchy and node based on the feature data constructed by the feature construction unit 220 under the control of the logic circuit 222.

図６ｂは、品詞タグ付け装置が実行するカスケード化ＣＲＦ品詞タグ付け方法のフローチャートである。品詞タグ付けモデルに計ｎ階層があり、階層ｉにはｍ_ｉ個のノードがあり、現在のノードを「ｊ」と呼ぶこととする。最初にＳ９０１において、論理回路２２２が階層ｉに「０」の値を割り当て、Ｓ９０２においてノードｊに「１」の値を割り当てる。次にＳ９０３において、ＣＲＦモデル特徴構築ユニット２２０が＜ｉ，ｊ＞ＣＲＦモデルを利用するための特徴データを構築する。ＣＲＦモデル特徴構築ユニット２２０は、品詞モデルのトレーニング処理において、特徴テンプレート集合に基づき、ＣＲＦモデル用の入力特徴データを構築する。異なる階層ｉに対して、以下の２つの方法を利用することができる。
１）ｉが「０」の場合に、ＣＲＦモデル用の特徴テンプレートに情報を取り込む処理を実行する。すなわち、タグ付け対象として入力されたテキストから直接、関連の特徴情報を抽出し、テンプレートにその情報を取り込むことにより、ＣＲＦモデルの入力特徴データを生成する。
２）ｉが「０」以外の場合には、階層０で特徴情報を抽出することに加えて、階層ｉ−１のＣＲＦモデルを利用して、タグ付け対象のテキストのタグ付け結果から特徴情報を抽出することにより、ＣＲＦモデルの入力特徴データを生成する。 FIG. 6b is a flowchart of the cascaded CRF part-of-speech tagging method performed by the part-of-speech tagging apparatus. There are a total of n hierarchy part-of-speech tagging model, the hierarchy i have m _i-number of nodes, and that the current node is referred to as "j". First, in S901, the logic circuit 222 assigns a value “0” to the hierarchy i, and assigns a value “1” to the node j in S902. In step S903, the CRF model feature construction unit 220 constructs feature data for using the <i, j> CRF model. The CRF model feature construction unit 220 constructs input feature data for the CRF model based on the feature template set in the part-of-speech model training process. The following two methods can be used for different levels i.
1) When i is “0”, a process of capturing information in the feature template for the CRF model is executed. That is, the related feature information is extracted directly from the text input as the tagging target, and the information is taken into the template, thereby generating the input feature data of the CRF model.
2) When i is other than “0”, in addition to extracting feature information at layer 0, feature information is obtained from the tagging result of the text to be tagged using the CRF model at layer i-1. Is extracted to generate input feature data of the CRF model.

Ｓ９０４において、生成された特徴データに基づき、品詞タグモデル１０の＜ｉ，ｊ＞ＣＲＦモデルを利用してテキストにタグ付けする。 In S904, the text is tagged using the <i, j> CRF model of the part-of-speech tag model 10 based on the generated feature data.

Ｓ９０５においてｊの値が増分され、Ｓ９０６において、ｊがｍ_ｉより大きいかどうかが判定される。ｊがｍ_ｉより小さい場合には、処理はＳ９０３に進む。ｊがｍ_ｉより大きい場合には、ｉの値がＳ９０７において増分され、処理はＳ９０８およびＳ９０２に進み、品詞階層ツリーのすべての階層のノードがＳ９０３とＳ９０４を終了するまで処理が継続される。このように、階層毎にテキストに品詞タグ付けを行うことにより、大規模なタグ集合での品詞タグ付けが実現される。 S905 the value of j is incremented at, in S906, j whether greater than _{m i} is determined. If j is _{m i} is smaller than, the process proceeds to S903. If j is greater than _{m i,} the value of i is incremented in S907, the process proceeds to S908 and S902, the processing until the nodes of all hierarchies of word class hierarchy tree has finished S903 and S904 is continued. In this way, by performing part-of-speech tagging on text for each hierarchy, part-of-speech tagging with a large tag set is realized.

以下では、タグ付け処理全体に対する理解を深めるため、単純な例を取り上げて説明する。 In the following, in order to deepen the understanding of the entire tagging process, a simple example will be taken up and described.

タグ付け対象のテキストとして、「北京入囲十大宜居城市」が与えられたとする。
階層０（＜０，１＞ＣＲＦモデルを利用）
タグ付け結果は、「北京／ｌａｂｅｌ３入囲／ｌａｂｅｌ１十／ｌａｂｅｌ２大／ｌａｂｅｌ１宜居／ｌａｂｅｌ１城市／ｌａｂｅｌ３」となる。
階層１（この階層用のすべてのＣＲＦモデルを利用）
１．＜１，１＞ＣＲＦモデルを利用して、「北京／ｌａｂｅｌ３入囲／ｌａｂｅｌ１２十／ｌａｂｅｌ２大／ｌａｂｅｌ１１宜居／ｌａｂｅｌ１１城市／ｌａｂｅｌ３」の結果が得られる。
２．＜１，２＞ＣＲＦモデルを利用して、以下同様である。
……
階層１に対する処理後のタグ付け結果は、「北京／ｌａｂｅｌ３２入囲／ｌａｂｅｌ１２十／ｌａｂｅｌ２１大／ｌａｂｅｌ１１宜居／ｌａｂｅｌ１１城市／ｌａｂｅｌ３１」である。
階層２
１．＜２、１＞ＣＲＦモデルを利用して、「北京／ｌａｂｅｌ３２入囲／ｌａｂｅｌ１２十／ｌａｂｅｌ２１大／ａ宜居／ａ城市／ｌａｂｅｌ３１」の結果が得られる。
２．＜２、２＞ＣＲＦモデルを利用して、以下同様である。
最終的に得られる完全なタグ付け結果は、「北京／ｎｓ入囲／ｖ十／ｍ大／ａ宜居／ａ城市／ｎ」である。 Suppose that the text to be tagged is “Beijing Engage 10 Dai Yicheng City”.
Tier 0 (using <0,1> CRF model)
The tagging result is “Beijing / label3 envelopment / label1 ten / label2 large / label1 Yui / label1 castle city / label3”.
Layer 1 (uses all CRF models for this layer)
1. Using the <1,1> CRF model, a result of “Beijing / label3 envelopment / label12 ten / label2 large / label11 Yii / label11 castle city / label3” is obtained.
2. The same applies to the following using the <1,2> CRF model.
......
The tagging result after processing for level 1 is “Beijing / label32 envelopment / label12 + 10 / label21 large / label11 Yui / label11 castle city / label31”.
Tier 2
1. By using the <2, 1> CRF model, a result of “Beijing / label32 envelopment / label12 ten / label21 large / a Yui / a Castle City / label31” is obtained.
2. The same applies hereinafter using the <2, 2> CRF model.
The complete tagging result finally obtained is “Beijing / ns envelopment / v 10 / m large / a Yui / a castle city / n”.

図７ａは、本発明の第２の実施例による品詞タグ付けシステムの概略ブロック図である。この品詞タグ付けシステムは、図１ａの品詞タグ付けシステムの構成要素に加えて、評価装置１６と、調整装置１７と、テスト集合構築装置１８とをさらに備える。テスト集合構築装置１８は、タグ付け対象のテキスト集合用のテスト集合として、品詞タグトレーニング集合１０から無作為に品詞タグ付け用テキスト集合を選択する。評価装置１６は、テスト集合が品詞タグ付けモデルを使用した品詞タグ付け処理に付された後のタグ付け結果を評価する。この場合、評価装置１６は、トライアルの結果に基づいてタグ付け精度を評価する。調整装置１７は、より高性能な品詞階層ツリーを構築するために、評価装置の評価結果に基づいて品詞階層ツリー構築装置１４を調整する。 FIG. 7a is a schematic block diagram of a part of speech tagging system according to a second embodiment of the present invention. This part-of-speech tagging system further includes an evaluation device 16, an adjustment device 17, and a test set construction device 18 in addition to the components of the part-of-speech tagging system of FIG. The test set construction device 18 randomly selects a part-of-speech tagging text set from the part-of-speech tag training set 10 as a test set for the text set to be tagged. The evaluation device 16 evaluates the tagging result after the test set is subjected to the part of speech tagging process using the part of speech tagging model. In this case, the evaluation device 16 evaluates the tagging accuracy based on the result of the trial. The adjustment device 17 adjusts the part-of-speech hierarchy tree construction device 14 based on the evaluation result of the evaluation device in order to construct a higher-performance part-of-speech hierarchy tree.

図７ｂは、品詞タグ付けシステムによって実行される品詞タグ付け処理のフローチャートである。図７ｂに示すように、Ｓ７０１において、テスト集合構築装置１８がテスト集合として品詞タグトレーニング集合１０の副集合を無作為に抽出する。Ｓ７０２において、品詞タグ付けシステムがトレーニング済み品詞タグ付けモデル１３を利用してテスト集合に品詞タグ付け処理を実行する。Ｓ７０３において、評価装置１６が品詞タグ付けされたテスト集合の精度を評価し、その評価結果を調整装置１７に渡す。その後、Ｓ７０４において、調整装置１７が評価結果に基づいて品詞タグ付けモデルの性能を判定し、品詞タグ付けモデルの性能が事前に決定された条件を満たさない場合には、Ｓ７０５において、クラスタ化結果を変更するために、品詞階層ツリー構築装置１４によって使用されたｗ１、ｗ２、ｗ３、およびｗ４のしきい値を調整する。Ｓ７０６において、調整装置はヒューリスティックルール（発見的規則）を用いてクラスタ化結果を調整する。この際には、例えば「ｎとｎｓは異なるクラスタに分類する」と規定するルールが使用される。 FIG. 7b is a flowchart of the part of speech tagging process performed by the part of speech tagging system. As shown in FIG. 7b, in S701, the test set construction device 18 randomly extracts a subset of the part-of-speech tag training set 10 as a test set. In S702, the part-of-speech tagging system uses the trained part-of-speech tagging model 13 to perform part-of-speech tagging processing on the test set. In step S 703, the evaluation device 16 evaluates the accuracy of the test set tagged with the part of speech and passes the evaluation result to the adjustment device 17. Thereafter, in S704, the adjustment device 17 determines the performance of the part-of-speech tagging model based on the evaluation result. If the performance of the part-of-speech tagging model does not satisfy the predetermined condition, the clustering result is obtained in S705. To adjust the thresholds of w1, w2, w3, and w4 used by the part-of-speech hierarchy tree construction device. In S706, the adjustment device adjusts the clustering result using a heuristic rule (heuristic rule). In this case, for example, a rule defining that “n and ns are classified into different clusters” is used.

図８ａは、本発明の第３の実施例による品詞タグ付けシステムのブロック図である。未知語の場合は、通常トレーニングコーパス内にその語のトレーニングデータがないので比較的タグ付け精度が低くなり、その影響で全体的なタグ付け精度が低下する。本発明の品詞タグ付けシステムは、未知語の品詞を訂正できるため、システムの全体的な品詞タグ付け精度が向上する。この品詞タグ付けシステムは、図１ａの品詞タグ付けシステムの構成要素に加えて、未知語品詞推測モデル構築装置１９と、未知語の品詞訂正装置２１とをさらに備える。未知語品詞推測モデル構築装置１９は、既存の品詞タグトレーニング集合１０から語構築ルールを学習し、学習した語構築ルールに従って未知語品詞推測モデル２０を構築する。未知語の品詞訂正装置２１は、未知語品詞推測モデル２０を使用して、品詞タグ付けモデル１３でタグ付けされたテキスト内の未知語の品詞を訂正する。 FIG. 8a is a block diagram of a part of speech tagging system according to a third embodiment of the present invention. In the case of an unknown word, since there is usually no training data for the word in the training corpus, the tagging accuracy is relatively low, and the overall tagging accuracy is reduced due to the influence. Since the part-of-speech tagging system of the present invention can correct the part-of-speech of an unknown word, the overall part-of-speech tagging accuracy of the system is improved. This part-of-speech tagging system further includes an unknown word part-of-speech inference model construction device 19 and an unknown word part-of-speech correction device 21 in addition to the components of the part-of-speech tagging system of FIG. The unknown word part-of-speech estimation model construction device 19 learns word construction rules from the existing part-of-speech tag training set 10, and constructs an unknown word part-of-speech estimation model 20 according to the learned word construction rules. The unknown word part-of-speech correction device 21 corrects the part of speech of the unknown word in the text tagged with the part-of-speech tagging model 13 using the unknown word part-of-speech estimation model 20.

図８ｂは、本発明の第３の実施例による品詞タグ付け方法を示す。図８ｂに示すように、Ｓ８０１において、未知語品詞推測モデル構築装置１９がまず品詞タグトレーニング集合内の語に対して直接構成素分割処理を実行し、分割後の直接構成素の属性を分析する（すなわち、品詞タグトレーニング集合内の各語の直接構成素を特定し、その直接構成素の属性にタグ付けする）ことにより、語の構成要素のシーケンスを取得する。 FIG. 8b shows a part-of-speech tagging method according to a third embodiment of the present invention. As shown in FIG. 8b, in S801, the unknown word part-of-speech inference model construction device 19 first executes direct constituent segmentation processing on the words in the part-of-speech tag training set, and analyzes the attributes of the direct constituents after the division. (Ie, identify the direct constituents of each word in the part-of-speech tag training set and tag the attributes of the direct constituents) to obtain a sequence of word constituents.

ここで、直接構成素の定義について簡単に説明する。大きな単位を構成する小さな単位は、大きな単位の構成要素と呼ばれる。そのため、大きな単位を直接構成する小さな単位は「直接構成素」と呼ばれる。品詞タグトレーニング集合内の各語は、語よりも小さい構成要素ではなく、語そのものである。したがって、ここでいう「直接構成素」と直接構成素属性の分析は、一般的な語分割や品詞タグ付けとは異なるものである。ここでいう「直接構成素」と直接構成素属性の分析とは、品詞タグ付けトレーニング集合内の２つ以上の文字から成る語を直下の単位に分割することを意味する。例えば、２つの文字から成る１つの語の場合であれば、直下の単位とは、その語を構成する個々の文字（形態素）を意味する。３つ以上の文字から成る語の場合、その語は、辞書内に存在する１つの語（最大一致）と１つの形態素とに分割される。「科学技術部」という語の場合、「科学」、「技術」の２語が辞書内に存在し、「科学技術」や「技術部」は存在しないと仮定すると、この語は「科学／技術／部」に分割される。「科学」、「技術部」、「技術」が辞書内に存在するとすれば、分割結果は「科学／技術部」になる。そのため、直接構成素は語のことも形態素のこともありうる。直接構成素の属性とは、主に、品詞タグの形式で表現される構文属性を意味する。直接構成素の属性は、可能なすべての品詞タグを含むことができる。 Here, the definition of the direct constituent will be briefly described. Small units that make up large units are referred to as large unit components. Therefore, a small unit that directly constitutes a large unit is called a “direct constituent”. Each word in the part-of-speech tag training set is a word itself, not a component smaller than a word. Therefore, the analysis of “direct constituents” and direct constituent attributes here is different from general word division and part-of-speech tagging. The analysis of “direct constituents” and direct constituent attributes here means that a word consisting of two or more characters in the part-of-speech tagging training set is divided into units immediately below. For example, in the case of a single word composed of two characters, the unit immediately below means individual characters (morphemes) constituting the word. In the case of a word composed of three or more characters, the word is divided into one word (maximum match) existing in the dictionary and one morpheme. In the case of the term “science and technology department”, it is assumed that two words “science” and “technology” exist in the dictionary, and “science and technology” and “technical department” do not exist. / Parts ". If “Science”, “Technology Department”, and “Technology” exist in the dictionary, the division result is “Science / Technology Department”. Thus, a direct constituent can be a word or a morpheme. The direct constituent attribute mainly means a syntactic attribute expressed in a part-of-speech tag format. Direct constituent attributes can include all possible part-of-speech tags.

表３は、「冷暴力、掃射」という２つの語に関する直接構成素分割と属性分析の結果を示したものである。 Table 3 shows the results of direct constituent segmentation and attribute analysis for the two words “cold violence, scavenging”.

上記から取得できるシーケンスは、以下のようなものである。
冷暴力 → 冷２ａＮ＿Ｂ暴力４ｎＮ＿Ｅ
掃射 → 掃２ｖＶ＿Ｂ射２ｖＶ＿Ｅ．
「冷射」が未知語の場合は、取得される語構成要素のシーケンスは「冷２ａ射２ｖ」となる。 The sequence that can be obtained from the above is as follows.
Cold violence → Cold 2 a N_B Violence 4 n N_E
Sweep → Sweep 2 v V_B Shoot 2 v V_E.
If “cold radiation” is an unknown word, the sequence of word components obtained is “cold 2 a radiation 2 v”.

Ｓ８０２において、未知語品詞推測モデル構築装置１９が品詞特徴テンプレートを選択する。 In S802, the unknown word part-of-speech estimation model construction device 19 selects a part-of-speech feature template.

Ｓ８０３において、未知語品詞推測モデル構築装置１９は、選択された品詞特徴テンプレートを使用して取得された語構成要素のシーケンスを変換し、任意の既知の機械学習アルゴリズムを用いて未知語推測モデル２０を生成する。例えば、未知語推測モデル２０を使用すると、「冷射」の品詞として以下を取得することが可能になる。
ＰＯＳ（冷２ａＶ＿Ｂ，射２ｖＶ＿Ｅ）＝Ｖ． In S803, the unknown word part-of-speech inference model construction device 19 converts the sequence of word components obtained using the selected part-of-speech feature template, and uses any known machine learning algorithm to know the unknown word inference model 20 Is generated. For example, when the unknown word estimation model 20 is used, the following can be acquired as the part of speech of “cold shooting”.
POS (cold 2 a V_B, shot 2 v V_E) = V.

Ｓ８０４において、品詞タグ付けシステムが、生成された未知語推測モデル２０を使用して、品詞タグ付けモデル１３でタグ付けされたテキスト内の未知語を再タグ付けする。 In S804, the part-of-speech tagging system uses the generated unknown word guessing model 20 to re-tag the unknown words in the text tagged with the part-of-speech tagging model 13.

語構成要素のシーケンスが「掃２ｖＶ＿Ｂ射２ｖＶ＿Ｅ」であり、以下の特徴テンプレートが選択されたとする。
／／構成語の品詞
Ｕ０１：％ｘ［−１，２］／／前の１つの構成素の第２の特徴（／）（「／」はヌルの特徴を表す）
Ｕ０２：％ｘ［０，２］／／現在の構成素の第２の特徴（ａ）
／／構成語の長さ
Ｕ０３：％ｘ［１，１］／／次の１つの構成素の第１の特徴（２，２）
／／構成語
Ｕ０４：％ｘ［０，０］／／現在の１つの構成素のゼロ特徴，
語構成要素のシーケンスは、ＣＲＦ等の任意の機械学習法のために、以下のような入力データに変換される。
ｉｆ（Ｔ（−１，２）＝‘／’）ｔａｇ＝ ‘Ｖ＿Ｂ’
ｉｆ（Ｔ（０，２）＝‘ｖ’）ｔａｇ＝ ‘Ｖ＿Ｂ’
ｉｆ（Ｔ（１，１）＝’２’）ｔａｇ＝ ‘Ｖ＿Ｂ’
ｉｆ（Ｔ（０，０）＝‘掃’）ｔａｇ＝ ‘Ｖ＿Ｂ’

ｉｆ（Ｔ（−１，２）＝‘ｖ’）ｔａｇ＝ ‘Ｖ＿Ｅ’
ｉｆ（Ｔ（０，２）＝‘ｖ’）ｔａｇ＝ ‘Ｖ＿Ｅ’
ｉｆ（Ｔ（１，１）＝‘２’）ｔａｇ＝ ‘Ｖ＿Ｅ’
ｉｆ（Ｔ（０，０）＝‘射’）ｔａｇ＝ ‘Ｖ＿Ｅ’
It is assumed that the sequence of word components is “sweep 2 v V_B shot 2 v V_E” and the following feature template is selected.
// Part of speech U01 of constituent word:% x [−1, 2] // Second feature of previous one constituent element (/) (“/” represents a null feature)
U02:% x [0,2] // second feature of current constituent (a)
// Constituent word length U03:% x [1,1] // First feature of the next one constituent element (2, 2)
// constituent word U04:% x [0, 0] // zero feature of one current constituent,
The sequence of word components is converted into the following input data for any machine learning method such as CRF.
if (T (-1,2) = '/') tag = 'V_B'
if (T (0,2) = 'v') tag = 'V_B'
if (T (1,1) = '2') tag = 'V_B'
if (T (0,0) = 'sweep') tag = 'V_B'

if (T (-1,2) = 'v') tag = 'V_E'
if (T (0,2) = 'v') tag = 'V_E'
if (T (1,1) = '2') tag = 'V_E'
if (T (0,0) = 'shooting') tag = 'V_E'

品詞タグ付けモデル１３でタグ付けされた最終テキスト内の未知語は、生成された未知語推測モデル２０を使用して再タグ付けされるが、生成された未知語推測モデル２０を使用して、現在の階層において品詞タグ付けモデル１３でタグ付けされたテキスト内の未知語を再タグ付けすることも可能である。換言すれば、現在の階層の品詞タグ付け結果を訂正し、その上で次の階層用の特徴データとして使用することができる。 Unknown words in the final text tagged with the part-of-speech tagging model 13 are re-tagged using the generated unknown word guessing model 20, but using the generated unknown word guessing model 20, It is also possible to retag unknown words in the text tagged with the part of speech tagging model 13 in the current hierarchy. In other words, the part-of-speech tagging result of the current hierarchy can be corrected and used as feature data for the next hierarchy.

上記では、本発明の実施例を説明するために中国語のテキストを例として使用したが、本発明は英語や日本語等の任意の言語における品詞タグ付けに適用できることは明らかである。 In the above, Chinese text was used as an example to describe an embodiment of the present invention, but it is clear that the present invention can be applied to part-of-speech tagging in any language such as English or Japanese.

上記の説明は本発明の好適な実施例のみを示したに過ぎず、本発明を限定することを意図するものではない。当該技術に精通する当業者には、付記する請求項により定義される本発明の範囲と精神を逸脱しない限り、これらの実施例に任意の修正・置換をなすことができることは理解されるであろう。 The foregoing description is only illustrative of preferred embodiments of the invention and is not intended to limit the invention. Those skilled in the art will understand that these embodiments may be arbitrarily modified and replaced without departing from the scope and spirit of the invention as defined by the appended claims. Let's go.

さらに、上記実施形態の一部又は全部は、以下の付記のようにも記載されうるが、これに限定されない。 Further, a part or all of the above-described embodiment can be described as in the following supplementary notes, but is not limited thereto.

（付記１）
品詞タグ付けシステムであって、
品詞階層ツリーに基づいて、品詞タグトレーニング集合内の第１のタグ付きテキストを使用して、階層的およびノード的に品詞タグ付けモデルをトレーニングする品詞タグ付けモデルトレーニング装置と、
トレーニング済みの品詞タグ付けモデルを使用して、タグ付け対象のテキストの品詞にタグ付けする品詞タグ付け装置と
を備えることを特徴とする品詞タグ付けシステム。 (Appendix 1)
A part-of-speech tagging system,
A part-of-speech tagging model training device that trains the part-of-speech tagging model hierarchically and nodely using the first tagged text in the part-of-speech tag training set based on the part-of-speech hierarchy tree;
A part-of-speech tagging system comprising: a part-of-speech tagging device that tags a part of speech of a text to be tagged using a trained part-of-speech tagging model.

（付記２）
前記品詞タグ付けモデルトレーニング装置が、
品詞階層ツリーに基づいて、階層的およびノード的に第２のタグ付きテキストに対して、品詞タグトレーニング集合内の第１のタグ付きテキストをタグ付けすることにより、ＣＲＦモデルトレーニングコーパスを構築するＣＲＦモデルトレーニングコーパス構築ユニットと、
前記ＣＲＦモデルトレーニングコーパス構築ユニットによってタグ付けされた第２のタグ付きテキストを使用することにより、対応する階層的およびノード的にＣＲＦモデルをトレーニングし、品詞タグ付けモデルを取得するＣＲＦモデルトレーニングユニットとを備えることを特徴とする付記１に記載の品詞タグ付けシステム。 (Appendix 2)
The part-of-speech tagging model training device comprises:
A CRF that builds a CRF model training corpus by tagging the first tagged text in the part-of-speech tag training set against the second tagged text hierarchically and nodely based on the part-of-speech hierarchical tree. A model training corpus building unit,
A CRF model training unit that trains the CRF model in a corresponding hierarchical and nodal manner to obtain a part-of-speech tagging model by using a second tagged text tagged by the CRF model training corpus building unit; The part-of-speech tagging system according to supplementary note 1, comprising:

（付記３）
前記ＣＲＦモデルトレーニングコーパス構築ユニットは、品詞階層ツリー内のタグ付き品詞の位置に対応する当該ノードのサブノードの名で、前記第１のタグ付きテキストのタグ付き品詞を置換することにより、階層的およびノード的にタグ付けを実行することを特徴とする付記２に記載の品詞タグ付けシステム。 (Appendix 3)
The CRF model training corpus building unit replaces the tagged part of speech of the first tagged text with the name of the subnode of the node corresponding to the position of the tagged part of speech in the part of speech hierarchical tree. The part-of-speech tagging system according to appendix 2, wherein tagging is executed node by node.

（付記４）
前記ＣＲＦモデルトレーニングユニットは、
（ａ）現在のレベルが「０」であり、特徴テンプレートが、第２のテキスト内の前後２つの語と、現在の語の前後の文字と、前後２語間の共起頻度とを含む場合
（ｂ）現在のレベルは「０」でなく、特徴テンプレートが、レベル０で選択された特徴テンプレートと、直前の階層での第２のテキスト内の前後２つの語と、品詞間の共起頻度と語と品詞間の共起頻度を含む場合
に応じて特徴テンプレートを選択することにより、階層的およびノード的にＣＲＦモデルをトレーニングすることを特徴とする付記３に記載の品詞タグ付けシステム。 (Appendix 4)
The CRF model training unit is
(A) The current level is “0”, and the feature template includes two words before and after the second text, characters before and after the current word, and the co-occurrence frequency between the two words before and after. (B) The current level is not “0”, and the feature template is a feature template selected at level 0, two words before and after the second text in the immediately preceding hierarchy, and the co-occurrence frequency between parts of speech The part-of-speech tagging system according to appendix 3, wherein the CRF model is trained hierarchically and nodely by selecting a feature template according to a case where co-occurrence frequencies between words and parts of speech are included.

（付記５）
前記品詞タグ付け装置が、
タグ付け対象のテキストに対してＣＲＦモデルを適用するために、階層的およびノード的に特徴データを構築するＣＲＦモデル特徴構築ユニットと、
前記ＣＲＦモデル特徴構築ユニットによって構築される特徴データに従って、階層的およびノード的にタグ付け対象のテキストの品詞のタグ付けを行うＣＲＦ品詞タグ付けユニットとを備えることを特徴とする付記２に記載の品詞タグ付けシステム。 (Appendix 5)
The part of speech tagging device comprises:
A CRF model feature construction unit that constructs feature data hierarchically and nodely to apply the CRF model to the text to be tagged;
The CRF part-of-speech tagging unit that tags part-of-speech of text to be tagged hierarchically and nodely according to the feature data constructed by the CRF model feature construction unit. Part-of-speech tagging system.

（付記６）
前記ＣＲＦモデル特徴構築ユニットは、
（ａ）現在のレベルは０であり、特徴データが、ＣＲＦモデルのトレーニング中にレベル０で選択された特徴テンプレートに入力するために使用され、タグ付け対象のテキストから抽出される場合
（ｂ）現在のレベルが０でなく、レベル０で抽出された特徴データが使用され、また直前のレベルのＣＲＦモデルによってタグ付けされた第２のテキストから特徴データが抽出される場合
に応じてＣＲＦモデルについて特徴データを構築することを特徴とする付記５に記載の品詞タグ付けシステム。 (Appendix 6)
The CRF model feature construction unit is:
(A) The current level is 0, and feature data is used to input to the feature template selected at level 0 during CRF model training and extracted from the text to be tagged (b) For the CRF model, if the current level is not 0, the feature data extracted at level 0 is used, and the feature data is extracted from the second text tagged with the previous level CRF model The part-of-speech tagging system according to appendix 5, wherein feature data is constructed.

（付記７）
前記品詞タグトレーニング集合内のタグ付きテキストの品詞間の関連性を分析する品詞階層ツリー構築装置をさらに備えることを特徴とする付記１に記載の品詞タグ付けシステム。 (Appendix 7)
The part-of-speech tagging system according to appendix 1, further comprising a part-of-speech hierarchy tree construction device that analyzes a relationship between part-of-speech of tagged text in the part-of-speech tag training set.

（付記８）
前記品詞階層ツリー構築装置が、
品詞の特徴を表わす特徴テンプレートを選択する品詞特徴テンプレート選択ユニットと、
選択した特徴テンプレートに従って品詞タグトレーニング集合内の品詞について特徴ベクトルを構築する特徴ベクトル構築ユニットと、
前記特徴ベクトルを使用して品詞間の類似度を計算する類似度計算ユニットと、
類似度に基づいて品詞をクラスタ化し、品詞階層ツリーを生成するクラスタ化ユニットとを備えることを特徴とする付記７に記載の品詞タグ付けシステム。 (Appendix 8)
The part of speech hierarchy tree construction device
A part-of-speech feature template selection unit that selects a feature template that represents a part-of-speech feature;
A feature vector construction unit that constructs a feature vector for the part of speech in the part of speech tag training set according to the selected feature template;
A similarity calculation unit for calculating a similarity between parts of speech using the feature vector;
The part-of-speech tagging system according to appendix 7, further comprising: a clustering unit that clusters the parts of speech based on the similarity and generates a part-of-speech hierarchical tree.

（付記９）
前記品詞タグトレーニング集合からテスト集合としてランダムにテキスト集合を選択するテスト集合構築装置と、
前記品詞タグ付けモデルを使用して、テスト集合からタグ付けされたテキストの品詞タグ付けの結果を評価する評価装置と、
評価結果に従って品詞階層ツリーを調整する調整装置とをさらに備えることを特徴とする付記８に記載の品詞タグ付けシステム。 (Appendix 9)
A test set construction device for randomly selecting a text set as a test set from the part of speech tag training set;
An evaluation device that evaluates the results of part-of-speech tagging of text tagged from a test set using the part-of-speech tagging model;
The part-of-speech tagging system according to appendix 8, further comprising an adjustment device that adjusts the part-of-speech hierarchy tree according to the evaluation result.

（付記１０）
前記調整装置は、前記品詞階層ツリー構築装置によって品詞間の類似度を計算するのに使用するしきい値を調整することを特徴とする付記９に記載の品詞タグ付けシステム。 (Appendix 10)
The part-of-speech tagging system according to appendix 9, wherein the adjustment unit adjusts a threshold value used for calculating a similarity between parts of speech by the part-of-speech hierarchy tree construction device.

（付記１１）
品詞タグトレーニング集合から語構築ルールを学習し、未知語の品詞推測モデルを構築する未知語品詞推測モデル構築装置と、
未知語の品詞推測モデルを使用して未知語の品詞をタグ付けし、品詞タグ付けモデルを使用してタグ付けされた未知語の品詞を訂正する未知語品詞訂正装置とを備えることを特徴とする付記１又は付記２に記載の品詞タグ付けシステム。 (Appendix 11)
An unknown word part-of-speech estimation model construction device that learns word construction rules from a part-of-speech tag training set and constructs a part-of-speech guess model of an unknown word;
An unknown word part-of-speech correction device that tags an unknown word part-of-speech using an unknown word part-of-speech inference model and corrects the part-of-speech of the unknown word tagged using the part-of-speech tagging model; The part-of-speech tagging system according to appendix 1 or appendix 2.

（付記１２）
品詞タグ付け方法であって、
品詞階層ツリーに基づき、品詞タグトレーニング集合内の第１のタグ付きテキストを使用して、階層毎およびノード毎に品詞タグ付けモデルをトレーニングする品詞タグ付けモデルトレーニングステップと、
トレーニング済み品詞タグ付けモデルを使用して、タグ付け対象のテキストの品詞にタグ付けする品詞タグ付けステップと
を有することを特徴とする品詞タグ付け方法。 (Appendix 12)
Part-of-speech tagging method,
A part-of-speech tagging model training step based on the part-of-speech hierarchy tree to train the part-of-speech tagging model for each hierarchy and for each node using the first tagged text in the part-of-speech tag training set;
A part-of-speech tagging method comprising: tagging part-of-speech tagging of a text to be tagged using a trained part-of-speech tagging model.

（付記１３）
前記品詞タグ付けモデルトレーニングステップが、
品詞階層ツリーに基づいて、階層的およびノード的に第２のタグ付きテキストに対して、品詞タグトレーニング集合内の第１のタグ付きテキストをタグ付けすることにより、ＣＲＦモデルトレーニングコーパスを構築するＣＲＦモデルトレーニングコーパス構築ステップと、
前記ＣＲＦモデルトレーニングコーパス構築ステップによってタグ付けされた第２のタグ付きテキストを使用することにより、対応する階層的およびノード的にＣＲＦモデルをトレーニングし、品詞タグ付けモデルを取得するＣＲＦモデルトレーニングステップを含むことを特徴とする付記１２に記載の品詞タグ付け方法。 (Appendix 13)
The part-of-speech tagging model training step comprises:
A CRF that builds a CRF model training corpus by tagging the first tagged text in the part-of-speech tag training set against the second tagged text hierarchically and nodely based on the part-of-speech hierarchical tree. Model training corpus construction step,
A CRF model training step of training the CRF model in a corresponding hierarchical and nodal manner to obtain a part of speech tagging model by using the second tagged text tagged by the CRF model training corpus construction step. The part-of-speech tagging method according to supplementary note 12, characterized by comprising:

（付記１４）
前記ＣＲＦモデルトレーニングコーパス構築ステップが、品詞階層ツリー内のタグ付き品詞の位置に対応する当該ノードのサブノードの名で、前記第１のタグ付きテキストのタグ付き品詞を置換することにより、階層的およびノード的にタグ付けを実行するステップを含むことを特徴とする付記１３に記載の品詞タグ付け方法。 (Appendix 14)
The CRF model training corpus building step replaces the tagged part of speech of the first tagged text with the name of the subnode of the node corresponding to the position of the tagged part of speech in the part of speech hierarchical tree, and 14. The part-of-speech tagging method according to appendix 13, including a step of performing tagging in a node manner.

（付記１５）
前記ＣＲＦモデルトレーニングステップが、
（ａ）現在のレベルが「０」であり、特徴テンプレートが、第２のテキスト内の前後２つの語と、現在の語の前後の文字と、前後２語間の共起頻度とを含む場合
（ｂ）現在のレベルは「０」でなく、特徴テンプレートが、レベル０で選択された特徴テンプレートと、直前の階層での第２のテキスト内の前後２つの語と、品詞間の共起頻度と語と品詞間の共起頻度を含む場合
に応じて特徴テンプレートを選択することにより、階層的およびノード的にＣＲＦモデルをトレーニングするステップを含むことを特徴とする付記１４に記載の品詞タグ付け方法。 (Appendix 15)
The CRF model training step comprises:
(A) The current level is “0”, and the feature template includes two words before and after the second text, characters before and after the current word, and the co-occurrence frequency between the two words before and after. (B) The current level is not “0”, and the feature template is a feature template selected at level 0, two words before and after the second text in the immediately preceding hierarchy, and the co-occurrence frequency between parts of speech 15. Part-of-speech tagging according to appendix 14, including the step of training the CRF model hierarchically and nodally by selecting feature templates according to the case of co-occurrence between words and parts of speech Method.

（付記１６）
前記品詞タグ付けステップが、
タグ付け対象のテキストに対してＣＲＦモデルを適用するために、階層的およびノード的に特徴データを構築するＣＲＦモデル特徴構築ステップと、
前記ＣＲＦモデル特徴構築ステップによって構築される特徴データに従って、階層的およびノード的にタグ付け対象のテキストの品詞のタグ付けを行うＣＲＦ品詞タグ付けステップとを含むことを特徴とする付記１３に記載の品詞タグ付け方法。 (Appendix 16)
The part of speech tagging step comprises:
A CRF model feature construction step of constructing feature data hierarchically and nodely to apply the CRF model to the text to be tagged;
The CRF part-of-speech tagging step of tagging part-of-speech of text to be tagged hierarchically and nodely according to the feature data constructed by the CRF model feature construction step. Part-of-speech tagging method.

（付記１７）
前記ＣＲＦモデル特徴構築ステップが、
（ａ）現在のレベルは０であり、特徴データが、ＣＲＦモデルのトレーニング中にレベル０で選択された特徴テンプレートに入力するために使用され、タグ付け対象のテキストから抽出される場合
（ｂ）現在のレベルが０でなく、レベル０で抽出された特徴データが使用され、また直前のレベルのＣＲＦモデルによってタグ付けされた第２のテキストから特徴データが抽出される場合
に応じてＣＲＦモデルについて特徴データを構築するステップを含むことを特徴とする付記１６に記載の品詞タグ付け方法。 (Appendix 17)
The CRF model feature construction step comprises:
(A) The current level is 0, and feature data is used to input to the feature template selected at level 0 during CRF model training and extracted from the text to be tagged (b) For the CRF model, if the current level is not 0, the feature data extracted at level 0 is used, and the feature data is extracted from the second text tagged with the previous level CRF model The part-of-speech tagging method according to supplementary note 16, comprising the step of constructing feature data.

（付記１８）
前記品詞タグトレーニング集合内のタグ付きテキストの品詞間の関連性を分析する品詞階層ツリー構築ステップをさらに有することを特徴とする付記１２に記載の品詞タグ付け方法。 (Appendix 18)
The part-of-speech tagging method according to appendix 12, further comprising a part-of-speech hierarchy tree construction step of analyzing a relationship between part-of-speech of tagged text in the part-of-speech tag training set.

（付記１９）
前記品詞階層ツリー構築ステップが、
品詞の特徴を表わす特徴テンプレートを選択する品詞特徴テンプレート選択ステップと、
選択した特徴テンプレートに従って品詞タグトレーニング集合内の品詞について特徴ベクトルを構築する特徴ベクトル構築ステップと、
前記特徴ベクトルを使用して品詞間の類似度を計算する類似度計算ステップと、
類似度に基づいて品詞をクラスタ化し、品詞階層ツリーを生成するクラスタ化ステップとを含むことを特徴とする付記１８に記載の品詞タグ付け方法。 (Appendix 19)
The part of speech hierarchy tree construction step includes:
A part-of-speech feature template selection step for selecting a feature template representing a part-of-speech feature;
A feature vector construction step of constructing a feature vector for the part of speech in the part of speech tag training set according to the selected feature template;
A similarity calculation step of calculating a similarity between parts of speech using the feature vector;
The part-of-speech tagging method according to appendix 18, further comprising: clustering parts-of-speech based on similarity and generating a part-of-speech hierarchy tree.

（付記２０）
前記品詞タグトレーニング集合からテスト集合としてランダムにテキスト集合を選択するテスト集合構築ステップと、
前記品詞タグ付けモデルを使用して、テスト集合からタグ付けされたテキストの品詞タグ付けの結果を評価する評価ステップと、
評価結果に従って品詞階層ツリーを調整する調整ステップとをさらに有することを特徴とする付記１９に記載の品詞タグ付け方法。 (Appendix 20)
A test set construction step of randomly selecting a text set as a test set from the part of speech tag training set;
Using the part of speech tagging model to evaluate a part of speech tagging result of text tagged from a test set;
The part-of-speech tagging method according to appendix 19, further comprising an adjustment step of adjusting the part-of-speech hierarchy tree according to the evaluation result.

（付記２１）
前記調整ステップが、前記品詞階層ツリー構築ステップによって品詞間の類似度を計算するのに使用するしきい値を調整するステップを含むことを特徴とする付記２０に記載の品詞タグ付け方法。 (Appendix 21)
21. The part-of-speech tagging method according to appendix 20, wherein the adjustment step includes a step of adjusting a threshold value used for calculating a similarity between parts of speech by the part-of-speech hierarchy tree construction step.

（付記２２）
品詞タグトレーニング集合から語構築ルールを学習し、未知語の品詞推測モデルを構築する未知語品詞推測モデル構築ステップと、
未知語の品詞推測モデルを使用して未知語の品詞をタグ付けし、品詞タグ付けモデルを使用してタグ付けされた未知語の品詞を訂正する未知語品詞訂正ステップとを有することを特徴とする付記１２又は付記１３に記載の品詞タグ付け方法。 (Appendix 22)
Learn the word construction rules from the part-of-speech tag training set, and build an unknown word part-of-speech inference model construction step to construct a part-of-speech inference model for an unknown word,
An unknown word part-of-speech correction step of tagging an unknown word part-of-speech using an unknown word part-of-speech inference model and correcting an unknown word part-of-speech tagged using the part-of-speech tagging model The part-of-speech tagging method according to appendix 12 or appendix 13.

（付記２３）
品詞タグ付けモデルのトレーニング装置であって、
品詞階層ツリーに基づいて、品詞タグトレーニング集合内の第１のタグ付きテキストを第２のテキストに階層毎およびノード毎にタグ付けすることにより、ＣＲＦモデルトレーニングコーパスを構築する、ＣＲＦモデルトレーニングコーパス構築ユニットと、
品詞タグ付けモデルを取得するために、ＣＲＦモデルトレーニングコーパス構築ユニットによってタグ付けされた第２のテキストを使用して、個々のＣＲＦモデルを階層毎およびノード毎にトレーニングするＣＲＦモデルトレーニングユニットと
を備えることを特徴とする品詞タグ付けモデルのトレーニング装置。 (Appendix 23)
A training device for a part-of-speech tagging model,
Build a CRF model training corpus by tagging the first tagged text in the part-of-speech tag training set to the second text layer by layer and node by node based on the part of speech hierarchy tree Unit,
A CRF model training unit that trains individual CRF models hierarchically and node by node using second text tagged by the CRF model training corpus building unit to obtain a part-of-speech tagging model Part-of-speech tagging model training device characterized by that.

（付記２４）
品詞タグ付けモデルのトレーニング方法であって、
品詞階層ツリーに基づいて、品詞タグトレーニング集合内の第１のタグ付きテキストを第２のテキストに階層毎およびノード毎にタグ付けすることにより、ＣＲＦモデルトレーニングコーパスを構築するＣＲＦモデルトレーニングコーパス構築ステップと、
品詞タグ付けモデルを取得するために、ＣＲＦモデルトレーニングコーパス構築ユニットによってタグ付けされた第２のテキストを使用して、個々のＣＲＦモデルを階層毎およびノード毎にトレーニングするＣＲＦモデルトレーニングステップとを有することを特徴とする品詞タグ付けモデルのトレーニング方法。 (Appendix 24)
A part-of-speech tagging model training method,
CRF model training corpus construction step for constructing a CRF model training corpus by tagging the first tagged text in the part-of-speech tag training set to the second text for each hierarchy and each node based on the part-of-speech hierarchy tree When,
CRF model training steps for training individual CRF models layer by layer and node by node using second text tagged by the CRF model training corpus building unit to obtain a part of speech tagging model A training method for a part-of-speech tagging model characterized by that.

１０：品詞タグトレーニング集合
１２：品詞タグ付けモデルトレーニング装置
１３：品詞タグ付けモデル
１４：品詞階層ツリー構築装置
１５：品詞階層ツリー
２２：品詞タグ付け装置
１４０：品詞特徴テンプレート選択ユニット
１４１：特徴ベクトル構築ユニット
１４２：類似度計算ユニット
１４３：クラスタ化ユニット
１２０：論理回路
１２１：ＣＲＦモデルトレーニングコーパス構築ユニット
１２２：ＣＲＦモデルトレーニングユニット
２２０：ＣＲＦモデル特徴構築ユニット
２２１：ＣＲＦ品詞タグ付けユニット
２２２：論理回路
１６：評価装置
１７：調整装置
１８：テスト集合構築装置
１９：未知語品詞推測モデル構築装置
２０：未知語品詞推測モデル
２１：未知語品詞訂正装置 10: Part of speech tag training set 12: Part of speech tagging model training device 13: Part of speech tagging model 14: Part of speech hierarchy tree construction device 15: Part of speech hierarchy tree 22: Part of speech tagging device 140: Part of speech feature template selection unit 141: Feature vector construction Unit 142: Similarity calculation unit 143: Clustering unit 120: Logic circuit 121: CRF model training corpus construction unit 122: CRF model training unit 220: CRF model feature construction unit 221: CRF part-of-speech tagging unit 222: Logic circuit 16: Evaluation device 17: Adjustment device 18: Test set construction device 19: Unknown word part of speech estimation model construction device 20: Unknown word part of speech estimation model 21: Unknown word part of speech correction device

Claims

A part-of-speech tagging system,
A part-of-speech tagging model training device that trains the part-of-speech tagging model hierarchically and nodely using the first tagged text in the part-of-speech tag training set based on the part-of-speech hierarchy tree;
A part-of-speech tagging system comprising: a part-of-speech tagging device that tags a part of speech of a text to be tagged using a trained part-of-speech tagging model.

The part-of-speech tagging model training device comprises:
A CRF that builds a CRF model training corpus by tagging the first tagged text in the part-of-speech tag training set against the second tagged text hierarchically and nodely based on the part-of-speech hierarchical tree. A model training corpus building unit,
A CRF model training unit that trains the CRF model in a corresponding hierarchical and nodal manner to obtain a part-of-speech tagging model by using a second tagged text tagged by the CRF model training corpus building unit; The part-of-speech tagging system according to claim 1.

The part of speech tagging device comprises:
A CRF model feature construction unit that constructs feature data hierarchically and nodely to apply the CRF model to the text to be tagged;
The CRF part-of-speech tagging unit for tagging part-of-speech of text to be tagged hierarchically and nodely according to feature data constructed by the CRF model feature construction unit. Part-of-speech tagging system.

An unknown word part-of-speech estimation model construction device that learns word construction rules from a part-of-speech tag training set and constructs a part-of-speech guess model of an unknown word;
An unknown word part-of-speech correction device that tags an unknown word part-of-speech using an unknown word part-of-speech inference model and corrects the part-of-speech of the unknown word tagged using the part-of-speech tagging model; The part-of-speech tagging system according to claim 1 or 2.

Part-of-speech tagging method,
A part-of-speech tagging model training step based on the part-of-speech hierarchy tree to train the part-of-speech tagging model for each hierarchy and for each node using the first tagged text in the part-of-speech tag training set;
A part-of-speech tagging method comprising: tagging part-of-speech tagging of a text to be tagged using a trained part-of-speech tagging model.

The part-of-speech tagging model training step comprises:
A CRF that builds a CRF model training corpus by tagging the first tagged text in the part-of-speech tag training set against the second tagged text hierarchically and nodely based on the part-of-speech hierarchical tree. Model training corpus construction step,
A CRF model training step of training the CRF model in a corresponding hierarchical and nodal manner to obtain a part of speech tagging model by using the second tagged text tagged by the CRF model training corpus construction step. The part-of-speech tagging method according to claim 5, further comprising:

The part of speech tagging step comprises:
A CRF model feature construction step of constructing feature data hierarchically and nodely to apply the CRF model to the text to be tagged;
7. The CRF part-of-speech tagging step of tagging part-of-speech of text to be tagged hierarchically and nodely according to the feature data constructed by the CRF model feature construction step. Part-of-speech tagging method.

Learn the word construction rules from the part-of-speech tag training set, and build an unknown word part-of-speech inference model construction step to construct a part-of-speech inference model for an unknown word,
An unknown word part-of-speech correction step of tagging an unknown word part-of-speech using an unknown word part-of-speech inference model and correcting an unknown word part-of-speech tagged using the part-of-speech tagging model The part-of-speech tagging method according to claim 5 or 6.

A training device for a part-of-speech tagging model,
Build a CRF model training corpus by tagging the first tagged text in the part-of-speech tag training set to the second text layer by layer and node by node based on the part of speech hierarchy tree Unit,
A CRF model training unit that trains individual CRF models hierarchically and node by node using second text tagged by the CRF model training corpus building unit to obtain a part-of-speech tagging model Part-of-speech tagging model training device characterized by that.

A part-of-speech tagging model training method,
CRF model training corpus construction step for constructing a CRF model training corpus by tagging the first tagged text in the part-of-speech tag training set to the second text for each hierarchy and each node based on the part-of-speech hierarchy tree When,
CRF model training steps for training individual CRF models layer by layer and node by node using second text tagged by the CRF model training corpus building unit to obtain a part of speech tagging model A training method for a part-of-speech tagging model characterized by that.