JP2008140117A

JP2008140117A - Apparatus for segmenting chinese character sequence to chinese word sequence

Info

Publication number: JP2008140117A
Application number: JP2006325457A
Authority: JP
Inventors: Zuikyo Cho; 瑞強張; Eiichiro Sumida; 英一郎隅田
Original assignee: National Institute of Information and Communications Technology
Current assignee: National Institute of Information and Communications Technology
Priority date: 2006-12-01
Filing date: 2006-12-01
Publication date: 2008-06-19

Abstract

<P>PROBLEM TO BE SOLVED: To provide an apparatus for segmenting a Chinese character sequence to an appropriate word sequence. <P>SOLUTION: The device comprises a Chinese subword list 64 and a statistical probability model 66 of Chinese sequence of IOB tags assigned to subwords. The apparatus further comprises a subword-based IOB tagging module 88 for segmenting a Chinese sentence 80 into a first Chinese word sequence with a maximum likelihood estimation using the subword list 64 and the probability model 66. The multiple-subword words in the first Chinese word sequence are segmented into subwords each being labeled with the IOB tags according to the segmentation. The words in the Chinese subword list 64 are treated as subwords by the subword-based IOB tagging module 88 in segmenting the Chinese sentence 80. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

この発明は中国語の単語セグメント化に関し、特に、サブワードタグ付けと信頼性尺度とに基づいた中国語の文のセグメント化に関する。 The present invention relates to Chinese word segmentation, and more particularly to segmentation of Chinese sentences based on subword tagging and reliability measures.

この明細書では、文を単語のシーケンスにセグメント化することを「単語セグメント化」と呼ぶ。単語セグメント化は、単語と単語との間を空間で分けずに書かれる自然言語の処理の前提条件である。中国語はこのような言語の１つである。 In this specification, segmenting a sentence into a sequence of words is referred to as “word segmentation”. Word segmentation is a prerequisite for natural language processing that is written without separating words between words. Chinese is one such language.

最近の中国語の単語セグメント化では、文字を用いた「ＩＯＢ」タグ付けの方策が広く用いられている（非特許文献３、４、５）。このスキームでは、単語の文字の各々について、もしそれが複数文字単語の最初の文字であれば「Ｂ」、もしその文字が独立した単語として機能していれば「Ｏ」、それ以外であれば「Ｉ」のラベルが付される。例えば、「全北京市」は「全／Ｏ北／Ｂ京／Ｉ市／Ｉ」となる。 In recent word segmentation in Chinese, the “IOB” tagging method using characters is widely used (Non-Patent Documents 3, 4, and 5). In this scheme, for each letter of the word, "B" if it is the first letter of a multi-letter word, "O" if the letter functions as an independent word, otherwise Labeled “I”. For example, “all Beijing cities” becomes “all / O north / B kyo / I city / I”.

これまで、中国語単語セグメント化の既存の実現例はいずれも、文字を用いたＩＯＢタグ付けを用いていることが分かった。
トマスエマーソン。２００５年。「第２回国際中国語単語セグメント化コンテスト」第４回中国語言語処理ＳＩＧＨＡＮワークショップ予稿集、ジェジュ、韓国。（Thomas Emerson. 2005. The second international Chinese word segmentation bakeoff. In Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing, Jeju, Korea.）ジョンラファルティ、アンドルーマッカラム、及びフェルナンドペレイラ。２００１年。「条件付き確率場：シーケンスデータのセグメント化及びラベル付けのための確率モデル」ＩＣＭＬ−２００１予稿集、５９１−５９８ページ。（John Lafferty, Andrew McCallum, and Fernando Pereira. 2001. Conditional random fields: probabilistic models for segmenting and labeling sequence data. In Proc. of ICML-2001, pages 591-598.）フシュンペン及びアンドルーマッカラム。２００４年。「条件付き確率場を用いた中国語セグメント化及び新単語の検出」コーリング−２００４予稿集、５６２−５６８ページ、ジュネーブ、スイス。（Fuchun Peng and Andrew McCallum. 2004. Chinese segmentation and new word detection using conditional random fields. In Proc. of Coling-2004, pages 562-568, Geneva, Switzerland.）ヒュシンツェン、ピチュアンチャン、ガレンアンドルー、ダニエルユラフスキ、及びクリストファーマニング。２００５年。「２００５年Ｓｉｇｈａｎコンテストのための条件付き確率場による単語セグメント化」第４回中国語言語処理ＳＩＧＨＡＮワークショップ予稿集、ジェジュ、韓国。（Huihsin Tseng, Pichuan Chang, Galen Andrew, Daniel Jurafsky, and Christopher Manning. 2005. A conditional random field word segmenter for Sighan bakeoff 2005. In Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing, Jeju, Korea.）ニアンウェンキュー及びリビンシェン。２００３年。「ＬＭＲタグ付けとしての中国語単語セグメント化」第２回中国語言語処理ＳＩＧＨＡＮワークショップ予稿集。（Nianwen Xue and Libin Shen. 2003. Chinese word segmentation as LMR tagging. In Proceedings of the Second SIGHAN Workshop on Chinese Language Processing.）フレデリックジェリネック、１９９８年。「音声認識の統計的方法」ＭＩＴプレス。（Frederick Jelinek. 1998. Statistical methods for speech recognition. the MIT Press.） So far, it has been found that all existing implementations of Chinese word segmentation use IOB tagging using characters.
Thomas Emerson. 2005. "2nd International Chinese Word Segmentation Contest" 4th Chinese Language Processing SIGHAN Workshop Proceedings, Jeju, Korea. (Thomas Emerson. 2005. The second international Chinese word segmentation bakeoff. In Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing, Jeju, Korea.) John Lafarty, Andrew McCallum, and Fernando Pereira. 2001. “Conditional Random Fields: Stochastic Models for Segmenting and Labeling Sequence Data” ICML-2001 Proceedings, pages 591-598. (John Lafferty, Andrew McCallum, and Fernando Pereira. 2001. Conditional random fields: probabilistic models for segmenting and labeling sequence data. In Proc. Of ICML-2001, pages 591-598.) Hushun Penn and Andrew McCallum. 2004. “Chinese segmentation and new word detection using conditional random fields” Calling-2004 Proceedings, pages 562-568, Geneva, Switzerland. (Fuchun Peng and Andrew McCallum. 2004. Chinese segmentation and new word detection using conditional random fields. In Proc. Of Coling-2004, pages 562-568, Geneva, Switzerland.) Husin Tseng, Picchuan Chang, Galen Andrew, Daniel Jurafski, and Christopher Manning. 2005. “Word Segmentation with Conditional Random Fields for 2005 Signhan Contest” Proceedings of the 4th Chinese Language Processing SIGHAN Workshop, Jeju, Korea. (Huihsin Tseng, Pichuan Chang, Galen Andrew, Daniel Jurafsky, and Christopher Manning. 2005. A conditional random field word segmenter for Sighan bakeoff 2005. In Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing, Jeju, Korea.) Nian Wen Kew and Libin Shen. 2003. "Chinese word segmentation as LMR tagging" Proceedings of the 2nd Chinese language processing SIGHAN workshop. (Nianwen Xue and Libin Shen. 2003. Chinese word segmentation as LMR tagging. In Proceedings of the Second SIGHAN Workshop on Chinese Language Processing.) Frederick Jerynek, 1998. "Statistical method of speech recognition" MIT press. (Frederick Jelinek. 1998. Statistical methods for speech recognition. The MIT Press.)

ＩＯＢタグ付けの方策には明らかに弱点がある。未知語（Ｏｕｔ−ｏｆ−ｖｏｃａｂｕｌａｒｙ：ＯＯＶ）率（Ｒ−ｏｏｖ）が高くなる代わりに、既知語（Ｉｎ−ｖｏｃａｂｕｒａｒｙ：ＩＶ）率（Ｒ−ｉｖ）が非常に低くなるのである。非特許文献１で報告された２００５年のコンテストのクローズドテストの結果では、ＩＯＢタグ付けに条件付き確率場（ｃｏｎｄｉｔｉｏｎａｌｒａｎｄｏｍｆｉｅｌｄｓ：ＣＲＦ）を用いた非特許文献４のツェンらのものが、使用された４つのコーパス全てについて非常に高いＲ−ｏｏｖを達成したが、Ｒ−ｉｖは低かった。単語のセグメント化においてＯＯＶ認識は非常に重要であるが、高いＩＶ率もまた望まれる。 There is a clear weakness in the IOB tagging strategy. Instead of increasing the unknown word (Out-of-vocabulary: OOV) rate (R-oov), the known word (In-vocabulary: IV) rate (R-iv) is very low. The result of the closed test of the 2005 contest reported in Non-Patent Document 1 uses that of Tseng et al. In Non-Patent Document 4 that uses conditional random fields (CRF) for IOB tagging. Very high R-oov was achieved for all four corpora, but R-iv was low. Although OOV recognition is very important in word segmentation, high IV rates are also desirable.

さらに、従来の方法では、複数文字の単語は必ずしも適切にセグメント化又はタグ付けされない。 Furthermore, conventional methods do not necessarily segment or tag multi-letter words properly.

従って、この発明の目的の１つは、中国語の文字シーケンスを中国語の単語シーケンスにより適切にセグメント化するための装置を提供することである。 Accordingly, one object of the present invention is to provide an apparatus for appropriately segmenting Chinese character sequences by Chinese word sequences.

この発明の別の目的は、弱点を減じ、最適なトレードオフを見出すことができるようにＲ−ｏｏｖとＲ−ｉｖとを変更できるような、中国語の文字シーケンスを中国語の単語シーケンスにより適切にセグメント化するための装置を提供することである。 Another object of the present invention is to make Chinese character sequences more suitable for Chinese word sequences so that R-oov and R-iv can be modified to reduce weaknesses and find optimal tradeoffs. It is to provide an apparatus for segmenting.

この発明の第１の局面に従った中国語の文字シーケンスを中国語の単語シーケンスにセグメント化するための装置は、中国語の文字と中国語の複数文字の単語とを列挙する中国語サブワードリストを記憶するための第１の記憶部と、中国語のサブワードに割当てられた第１のタグのシーケンスの統計的確率モデルを記憶するための第２の記憶部と、を含む。第１のタグは、独立した単語、複数文字単語の第１のサブワード、又はそれ以外を示す。装置はさらに、中国語の文字シーケンスをサブワードリスト及び統計的確率モデルを用いて最尤推定により第１の中国語の単語シーケンスにセグメント化するための、サブワードを用いたセグメント化手段を含む。第１の中国語の単語シーケンスにおける複数サブワードの単語は、各々がセグメント化に従って第１のタグでラベルを付されたサブワードにセグメント化される。中国語のサブワードリストにおける単語は、サブワードを用いたセグメント化手段によって中国語の文字シーケンスをセグメント化する際に、サブワードとして扱われる。 An apparatus for segmenting a Chinese character sequence into a Chinese word sequence according to the first aspect of the present invention provides a Chinese subword list for enumerating Chinese characters and Chinese multi-character words And a second storage unit for storing a statistical probability model of a sequence of first tags assigned to Chinese subwords. The first tag indicates an independent word, a first subword of a multi-character word, or otherwise. The apparatus further includes a subword segmenting means for segmenting the Chinese character sequence into a first Chinese word sequence by maximum likelihood estimation using a subword list and a statistical probability model. The multiple subword words in the first Chinese word sequence are segmented into subwords, each labeled with a first tag according to segmentation. The words in the Chinese subword list are treated as subwords when the Chinese character sequence is segmented by the segmenting means using the subwords.

好ましくは、サブワードを用いたセグメント化手段は第１のタグの各々の予め定義された信頼性の確率を出力し、装置はさらに、中国語の文字と中国語の単語とを列挙する中国語文字辞書を記憶するための第３の記憶部と、中国語の統計的言語モデルを記憶するための第４の記憶部と、入力された中国語の文字シーケンスを、辞書と言語モデルとを用いた最尤推定で第２の中国語の単語シーケンスにセグメント化し、第２の中国語の単語シーケンス中の文字の各々に第２のタグを付すための、辞書を用いた単語セグメント化手段とを含む。第２のタグは、独立した単語として、複数文字単語の最初の文字として、又はそれ以外として機能する文字を示す。装置はさらに、第１の中国語単語シーケンス中のサブワードの各々に割当てるべきタグを、ともにサブワードに割当てられた第１及び第２のタグとサブワードの信頼性確率との関数として決定するための決定手段を含む。 Preferably, the subword segmentation means outputs a predefined reliability probability for each of the first tags, and the apparatus further includes Chinese characters enumerating Chinese characters and Chinese words A third storage unit for storing a dictionary, a fourth storage unit for storing a statistical language model of Chinese, and an input Chinese character sequence using a dictionary and a language model Word-segmenting means using a dictionary for segmenting into a second Chinese word sequence with maximum likelihood and attaching a second tag to each character in the second Chinese word sequence . The second tag indicates a character that functions as an independent word, as the first character of a multi-character word, or otherwise. The apparatus further determines to determine a tag to be assigned to each of the subwords in the first Chinese word sequence as a function of the first and second tags both assigned to the subword and the reliability probability of the subword. Including means.

この構成により、タグ決定手段で用いられる関数を変化させることによってＲ−ｉｖとＲ−ｏｏｖとが最適化されることが、実験によって確認された。 It has been confirmed by experiments that this configuration optimizes R-iv and R-oov by changing the function used in the tag determining means.

より好ましくは、サブワードを用いたセグメント化手段はｉ番目の単語ｗ_ｉの第１のタグｔの各々についての、予め定義された信頼性の確率ＣＭ_ｉｏｂ（ｔ｜ｗ_ｉ）を以下の式に従って出力し、 More preferably, the sub-word segmentation means calculates a predefined reliability probability CM _iob (t | w _i ) for each of the first tags t of the i-th word w _i according to the following equation: Output,

ここで分子は単語ｗ_ｉがｔとしてラベル付けされた全ての観測シーケンスの和であって、Ｗはセグメント化された中国語の単語シーケンスであり、Ｔ＝ｔ_ｏｔ_１…ｔ_Ｍは単語シーケンスＷに割り当てられたタグシーケンスである。

Where the numerator is the sum of all observation sequences where the word w _i is labeled t, W is the segmented Chinese word sequence, and T = t _o t ₁ ... t _M is the word sequence This is a tag sequence assigned to W.

この構成において、この装置は、文字及び単語のセグメント化に有効であることが分かっているＣＲＦ方策を用いてトレーニングされてもよい。 In this configuration, the device may be trained using CRF strategies that have been found to be effective for character and word segmentation.

これに代えて、サブワードを用いたセグメント化手段は単語ｗの第１のタグｔの各々についての、予め定義された信頼性の確率ＣＭ_ｉｏｂ（ｔ｜ｗ）を以下の式に従って出力し、 Alternatively, the sub-word segmentation means outputs a predefined reliability probability CM _iob (t | w) for each of the first tags t of the word w according to the following equation:

ここでｈ_ｉはセグメント化のｉ番目の仮説である。

Where h _i is the i-th hypothesis of segmentation.

この構成において、装置は、タグ付けの正確さを改善するための多種類の特徴を組入れることができる最大エントロピー（ＭａｘＥｎｔ）方策を用いてトレーニングされてもよい。 In this configuration, the device may be trained using a maximum entropy (MaxENT) strategy that can incorporate multiple types of features to improve tagging accuracy.

［第１の実施の形態］
＜構造＞
以下では、サブワードを用いたＩＯＢタグ付けを提案するが、これは、単一の中国語の文字に加えて、もっとも頻度の高い複数文字の単語からなる予め定義されたレキシコンのサブセットにもタグを割当てるものである。もし中国語の文字のみを用いるとすれば、サブワードを用いたＩＯＢタグ付けは文字を用いたものと同じになる。上と同じ例を挙げれば、「全北京市」はサブワードを用いたタグ付けでは「全／Ｏ北京／Ｂ市／Ｉ」となり、ここで「北京」は１つの単位としてタグ付けされる。 [First Embodiment]
<Structure>
In the following, IOB tagging using subwords is proposed, which tags not only a single Chinese character, but also a predefined subset of lexicons consisting of the most frequent multi-character words. Assign. If only Chinese characters are used, IOB tagging using subwords is the same as using characters. To give the same example as above, “All Beijing” would be “All / O Beijing / B City / I” when tagged with subwords, where “Beijing” would be tagged as one unit.

‐この実施の形態の中国語単語のセグメント化処理‐
この実施の形態の単語セグメント化処理２０を図１に例示する。これは３つの部分を含む。すなわち、入力された文３０のＩＶ単語をセグメント化するための、辞書を用いたＮ−グラム単語セグメント化（以下「辞書を用いたセグメント化」と称する）３２、辞書を用いたセグメント化の単語セグメント化におけるＯＯＶを認識するためのＣＲＦによるサブワードを用いたＩＯＢタグ付け３４、及び辞書を用いたセグメント化３２とサブワードを用いたＩＯＢタグ付け３４との両者の結果を統合し、セグメント化された中国語の単語シーケンス３８を出力するための信頼度を用いた単語セグメント化３６、である。各ステップの結果を示す例もまた図１において与えられている。 -Chinese word segmentation processing of this embodiment-
The word segmentation process 20 of this embodiment is illustrated in FIG. This includes three parts. That is, N-gram word segmentation using a dictionary (hereinafter referred to as “segmentation using a dictionary”) 32 for segmenting IV words of an input sentence 30, segmented words using a dictionary Segmented by combining the results of both IOB tagging 34 using subwords with CRF to recognize OOV in segmentation, and segmenting 32 using dictionary and IOB tagging 34 using subwords Word segmentation 36 using confidence for outputting a Chinese word sequence 38. An example showing the results of each step is also given in FIG.

図２はこの実施の形態の中国語単語セグメント化装置５０を示すブロック図である。図２を参照して、中国語単語セグメント化装置５０は、セグメント化された中国語の文を含むトレーニングデータ６０と、ともに中国語単語セグメント化装置５０の単語セグメント化で用いられる、サブワードリスト６４と確率モデル６６とを生成するためのモデルトレーニングモジュール６２と、を含む。 FIG. 2 is a block diagram showing the Chinese word segmentation apparatus 50 of this embodiment. Referring to FIG. 2, Chinese word segmentation device 50 includes training data 60 including segmented Chinese sentences, and subword list 64 used together with word segmentation of Chinese word segmentation device 50. And a model training module 62 for generating the probability model 66.

中国語単語セグメント化装置５０はさらに、入力された中国語の文８０をセグメント化するための辞書を用いた単語セグメント化モジュール８６と、中国語辞書８２及び中国語統計言語モデル８４と、中国語の文８０を単語のシーケンスに再セグメント化して各単語にサブワードを用いたＩＯＢタグＴ_ｉｏｂ及び対応の信頼性尺度ＣＭ_ｉｏｂ（ｔ_ｉｏｂ｜ｗ）をタグ付けするためのサブワードを用いたＩＯＢタグ付けモジュール８８と、辞書を用いた単語セグメント化モジュール８６とサブワードを用いたＩＯＢタグ付けモジュール８８との結果を用いて中国語文８０を最終的にセグメント化してセグメント化の結果９２を出力するための、信頼性尺度を用いたセグメント化モジュール９０と、を含む。 The Chinese word segmentation apparatus 50 further includes a word segmentation module 86 using a dictionary for segmenting an input Chinese sentence 80, a Chinese dictionary 82 and a Chinese statistical language model 84, and Chinese. IOB tagging using subwords to re-segment sentence 80 into a sequence of words and tag each word with an IOB tag _Tiob using a subword and a corresponding reliability measure _CMiob ( _tiob | w) Using the results of the module 88, the word segmentation module 86 using a dictionary, and the IOB tagging module 88 using subwords, the Chinese sentence 80 is finally segmented and a segmentation result 92 is output. And a segmentation module 90 using a reliability measure.

‐確率モデルのトレーニング‐
図３を参照して、モデルトレーニングモジュール６２は、トレーニングデータ６０内の単語の頻度を計数して単語リストを作成するための頻度カウントモジュール１１０と、単語リストを頻度の降順でソートして順序付きリストを作成するためのソートモジュール１１２と、順序付きリスト１１４を記憶するための記憶部と、順序付きリスト１１４から上位２０００個の複数文字単語と全ての一文字単語とを選択することによってサブワードリスト６４を出力するための選択モジュール１１６と、を含む。 -Training of probability models-
Referring to FIG. 3, the model training module 62 counts the frequency of words in the training data 60 and creates a word list, and the word training sorts the word list in descending frequency order. A sub-word list 64 by selecting a sort module 112 for creating a list, a storage unit for storing the ordered list 114, and the top 2000 multi-character words and all single-character words from the ordered list 114. And a selection module 116 for outputting.

モデルトレーニングモジュール６２はさらに、トレーニングデータ６０内の文をセグメント化し、セグメント化用辞書としてサブワードリスト６４を用いて各単語にＩＯＢタグをタグ付けするためのＩＯＢタグ付け及びセグメント化モジュール１２０を含む。ＩＯＢタグ付け及びセグメント化モジュール１２０の出力はサブワードトレーニングデータ１２２であり、これは記憶部に記憶される。 The model training module 62 further includes an IOB tagging and segmentation module 120 for segmenting sentences in the training data 60 and tagging each word with an IOB tag using the subword list 64 as a segmentation dictionary. The output of the IOB tagging and segmentation module 120 is subword training data 122, which is stored in the storage.

モデルトレーニングモジュール６２はさらに、サブワードトレーニングデータ１２２を用いて確率モデル６６をトレーニングするためのトレーニングモジュール１２４を含む。 The model training module 62 further includes a training module 124 for training the probability model 66 using the subword training data 122.

サブワードを用いたＩＯＢタグ付けモジュール８８において用いられる確率モデル６６をトレーニングするにはいくつかのステップがある。第１に、頻度カウントモジュール１１０及びＩＯＢタグ付け及びセグメント化モジュール１２０により、トレーニングデータ中のカウントによって降順でソートされたトレーニングデータから単語リストを抽出する。選択モジュール１１６が、全ての一文字単語と上位２０００個の複数文字単語とをＩＯＢタグ付けのレキシコンサブセットとして選択する。単語はサブワードリスト６４に列挙される。 There are several steps to training the probabilistic model 66 used in the IOB tagging module 88 with subwords. First, the frequency count module 110 and the IOB tagging and segmentation module 120 extract a word list from the training data sorted in descending order by count in the training data. The selection module 116 selects all single-letter words and the top 2000 multi-letter words as IOX-tagged lexicon subsets. The words are listed in the subword list 64.

サブセットが単一文字の単語のみからなる場合、サブワードを用いたＩＯＢタグ付けモジュール８８は文字を用いたＩＯＢタグ付け部となる。サブセット中の単語を、サブワードを用いるＩＯＢタグ付けモジュール８８におけるＩＯＢタグ付けのためのサブワードとみなす。 When the subset consists of only single-character words, the IOB tagging module 88 using subwords becomes an IOB tagging unit using characters. The words in the subset are considered as subwords for IOB tagging in the IOB tagging module 88 using subwords.

第２に、ＩＯＢタグ付け及びセグメント化モジュール１２０はトレーニングデータ中の単語をサブセットに属するサブワードに再セグメント化し、これらにＩＯＢタグを割当てる。文字ベースのＩＯＢタグ付け部では、再セグメント化の可能性は１つあるだけである。しかしながら、この実施の形態のＩＯＢタグ付け及びセグメント化では、サブワードを用いたＩＯＢタグ付けに多数の選択肢がある。例えば、「北京市」は「北京市／Ｏ」又は「北京／Ｂ市／Ｉ」又は「北／Ｂ京／Ｉ市／Ｉ」とセグメント化することが可能である。この実施の形態では、前向最大一致法（ｆｏｒｗａｒｄｍａｘｉｍａｌｍａｔｃｈ：ＦＭＭ）を用いて多義性を解消している。もちろん、後向最大一致法（ｂａｃｋｗａｒｄｍａｘｉｍａｌｍａｔｃｈ：ＢＭＭ）又は他の方策を適用することもできる。これらの方策によるわずかな相違が、サブワードを用いる方策に重大な帰結をもたらすとは思われないので、比較実験は行わなかった。 Second, IOB tagging and segmentation module 120 re-segments the words in the training data into subwords belonging to the subset and assigns them IOB tags. For character-based IOB tagging, there is only one possibility of re-segmentation. However, in the IOB tagging and segmentation of this embodiment, there are many options for IOB tagging using subwords. For example, “Beijing City” can be segmented as “Beijing City / O” or “Beijing / B City / I” or “North / B Kyo / I City / I”. In this embodiment, the ambiguity is eliminated by using a forward maximum match (FMM). Of course, backward maximum match (BMM) or other strategies can also be applied. Since slight differences due to these strategies do not appear to have significant consequences for the subword strategy, no comparative experiments were performed.

第３のステップでは、トレーニングモジュール１２４はＣＲＦ方策を用いてトレーニングデータ上で確率モデル６６（非特許文献２）をトレーニングする。サイトhttp://www.chasen.org/〜taku/softwareから、「ＣＲＦ＋＋」パッケージをダウンロードし、これを用いた。 In the third step, the training module 124 trains the probability model 66 (Non-Patent Document 2) on the training data using a CRF strategy. The "CRF ++" package was downloaded from the site http://www.chasen.org/~taku/software and used.

ＣＲＦによれば、所与の単語シーケンスＷ＝ｗ_０ｗ_１…ｗ_ＭのＩＯＢタグシーケンスＴ＝ｔ_０ｔ_１…ｔ_Ｍの確率は以下で定義される。 According to the CRF, the probability of the IOB tag sequence _{_{_{T = t 0 t 1 ... t}}} M of a given word sequence _{_{_{W = w 0 w 1 ... w}}} M is defined by the following.

ここで

here

をバイグラム特徴関数と呼ぶこととする。なぜなら、この特徴は先の観測ｔ_ｉ−１と現在の観測ｔ_ｉとを同時にトリガするからである。さらに、

Is called a bigram feature function. This is because this feature triggers the previous observation t _i-1 and the current observation t _i simultaneously. further,

をユニグラム特徴関数と呼ぶこととする。なぜなら、これらは現在の観測ｔ_ｉのみをトリガするからである。λ_ｋとμ_ｋとは、それぞれ特徴関数ｆ_ｋ及びｇ_ｋに対応するモデルパラメータである。

Is called a unigram feature function. Because since these triggers only current observation t _i. λ _k and μ _k are model parameters corresponding to feature functions f _k and g _k , respectively.

モデルパラメータは、Ｌ−ＢＦＧＳ（Ｌｉｍｉｔｅｄ−ｍｅｍｏｒｙＢｒｏｙｄｅｎ，Ｆｌｅｔｃｈｅｒ，Ｇｏｌｄｆａｒｂ，Ｓｈａｎｎｏ）勾配降下最適化法を用いて、トレーニングデータの対数尤度を最大化することでトレーニングされる。過学習を克服するために、トレーニングにはガウス事前分布を課す。 The model parameters are trained by maximizing the log likelihood of the training data using the L-BFGS (Limited-memory Broyden, Fletcher, Goldfarb, Shanno) gradient descent optimization method. To overcome over-learning, the training is Gaussian prior.

この実施の形態で用いられるユニグラムの特徴量は以下の種類を含む。 The unigram feature values used in this embodiment include the following types.

ｗ_０，ｗ_−１，ｗ_１，ｗ_−２，ｗ_２，ｗ_０ｗ_−１，ｗ_０ｗ_１，ｗ_−１ｗ_１，ｗ_−２ｗ_−１，ｗ_２ｗ_０
ここでｗは「単語（ｗｏｒｄ）」を表す。添え字は位置指標である。０は現在の単語を意味し、−１、−２は左側の１番目又は２番目の単語を意味し、１、２は右側の１番目又は２番目の単語を意味する。 w ₀ , w ₋₁ , w ₁ , w ₋₂ , w ₂ , w ₀ w ₋₁ , w ₀ w ₁ , w ₋₁ w ₁ , w ₋₂ w ₋₁ , w ₂ w ₀
Here, w represents a “word”. The subscript is a position index. 0 means the current word, -1, -2 mean the first or second word on the left, and 1, 2 mean the first or second word on the right.

バイグラムの特徴量に対しては、先の観測及び現在の観測、ｔ_−１ｔ_０のみを用いる。 For the bigram feature, only the previous and current observations, t ₋₁ t _0, are used.

特徴量の選択については、単純に、トレーニングデータ中の各特徴量の絶対的なカウントのみを用いる。各特徴量の種類について切捨て値を定義し、切捨て値を超えた発生回数の特徴量を選択する。 For the selection of feature values, only the absolute count of each feature value in the training data is simply used. A truncation value is defined for each type of feature quantity, and a feature quantity with the occurrence count exceeding the truncation value is selected.

トレーニングモジュール１２４によるトレーニングでは、フォワード・バックワードアルゴリズムが用いられる。 In the training by the training module 124, a forward / backward algorithm is used.

‐辞書を用いた単語のセグメント化‐
辞書を用いた方策は周知の方法である。しかし、辞書を用いた方策ではより高いＲ−ｉｖ率が得られるものの、ＯＯＶ検出はできないことに注意する必要がある。これをＮグラムの言語モデル（ｌａｎｇｕａｇｅｍｏｄｅｌ：ＬＭ）と組合せて、セグメント化の多義性を解決する。与えられた中国語文字シーケンスＣ＝ｃ_０ｃ_１ｃ_２…ｃ_Ｎについて、単語セグメント化の問題は、以下を満たす単語シーケンスＷ＝ｗ_ｔ０，ｗ_ｔ１，ｗ_ｔ２，…ｗ_ｔＭを見出すこととして定式化される。 -Word segmentation using a dictionary-
A policy using a dictionary is a well-known method. However, it should be noted that although a higher R-iv rate can be obtained with a policy using a dictionary, OOV detection is not possible. This is combined with an N-gram language model (LM) to solve the ambiguity of segmentation. For a given Chinese character sequence C = c ₀ c ₁ c ₂ ... C _N , the word segmentation problem is as finding a word sequence W = w _t0 , w _t1 , w _t2 _,. Formulated.

すなわち

Ie

上の式を導くにあたってはベイズ法を適用した。

The Bayes method was applied to derive the above equation.

単語シーケンスは文字シーケンスと一貫性を保たなければならないので、Ｐ（Ｃ｜Ｗ）を、引数がともに等しい場合には１に等しくそうでない場合には０に等しい、クロネッカーのデルタ関数系列（ｕ；ｖ）の乗算に拡張する。Ｐ（ｗ_ｔ０，ｗ_ｔ１，…ｗ_ｔＭ）は連鎖規則によって拡張可能な言語モデルである。 Since the word sequence must be consistent with the character sequence, P (C | W) is a Kronecker delta function sequence (u) equal to 1 if the arguments are both equal and 0 otherwise. Extended to multiplication of v). P (w _t0 , w _t1 ,... W _tM ) is a language model that can be extended by a chain rule.

トライグラムＬＭを用いる場合は以下のようになる。 When the trigram LM is used, it is as follows.

ここでｗ_ｉはｗ_ｔｉを略記したものである。

Here, w _i is an abbreviation of w _ti .

式２は辞書を用いた単語セグメント化の処理を示す。レキシコン（辞書）をあたって全てのＩＶを見出し、ＬＭによって単語シーケンスを評価した。ビタビ探索に代えてビーム探索法を用い（ジェレネックの非特許文献６、１９９８年、を参照）、最適な単語シーケンスをデコードした。ビーム探索法によりデコーディングを加速できることが分かったからである。ＮグラムのＬＭを用いて全ての仮説のスコアを定め、そのうち最もＬＭのスコアが高いものを最終出力とした。 Expression 2 shows a word segmentation process using a dictionary. All IVs were found by hitting a lexicon (dictionary) and the word sequence was evaluated by LM. The beam search method was used instead of the Viterbi search (see Jennec Non-Patent Document 6, 1998), and the optimum word sequence was decoded. This is because it has been found that decoding can be accelerated by the beam search method. All hypotheses were scored using N grams of LM, and the one with the highest LM score was taken as the final output.

図４は辞書を用いた単語セグメント化モジュール８６の詳細を示す概略図である。図４を参照して、辞書を用いた単語セグメント化モジュール８６は、中国語の辞書８２を参照することによって、中国語の文８０のセグメント化に可能なセグメント化仮説全てを生成する仮説生成モジュール１４０と、仮説を記憶するための記憶部１４２と、統計的言語モデル８４を用いて記憶部１４２に記憶された仮説の各々の尤度を計算するための尤度計算モジュール１４４と、最も高い尤度（ＬＭスコア）の仮説を選択するための最尤度選択モジュール１４６とを含む。 FIG. 4 is a schematic diagram showing details of the word segmentation module 86 using a dictionary. Referring to FIG. 4, a word segmentation module 86 using a dictionary refers to a Chinese dictionary 82, and generates a hypothesis generation module that generates all possible segmentation hypotheses for segmenting a Chinese sentence 80. 140, a storage unit 142 for storing hypotheses, a likelihood calculation module 144 for calculating the likelihood of each hypothesis stored in the storage unit 142 using the statistical language model 84, and the highest likelihood And a maximum likelihood selection module 146 for selecting a degree (LM score) hypothesis.

図４は概略図であって、辞書を用いた単語セグメント化モジュール８６の機能を実現するソフトウェアは、ビーム探索アルゴリズムを用いてより計算量が少なく高速の動作をするためにより洗練された構造を有することに注意されたい。 FIG. 4 is a schematic diagram, and the software that implements the function of the word segmentation module 86 using a dictionary has a more sophisticated structure to operate at a lower speed by using a beam search algorithm. Please note that.

‐サブワードを用いたＣＲＦによるＩＯＢタグ付け‐
図５を参照して、サブワードを用いたＩＯＢタグ付けモジュール８８は、サブワードリスト６４を参照することによって中国語の文８０のセグメント化に対し可能な単語セグメント化の全てを生成する仮説生成モジュール１６０と、仮説を記憶するための記憶部１６２と、確率モデル６６を用いて記憶部１６２に記憶された仮説の各々の尤度を計算するための尤度計算モジュール１６４と、最も高い尤度の仮説を選択するための最尤度選択モジュール１６６とを含む。ここではビーム探索アルゴリズムを用いる。 -IOB tagging with CRF using subwords-
Referring to FIG. 5, subword-based IOB tagging module 88 generates all possible word segmentations for segmenting Chinese sentence 80 by referring to subword list 64. A storage unit 162 for storing the hypothesis, a likelihood calculation module 164 for calculating the likelihood of each hypothesis stored in the storage unit 162 using the probability model 66, and a hypothesis with the highest likelihood And a maximum likelihood selection module 166 for selecting. Here, a beam search algorithm is used.

図５は概略図であって、サブワードを用いたＩＯＢタグ付けモジュール８８の機能を実現するソフトウェアは、より計算量が少なく高速の動作をするためにより洗練された構造を有することに注意されたい。 Note that FIG. 5 is a schematic diagram, and the software that implements the function of the IOB tagging module 88 using subwords has a more sophisticated structure for less computation and faster operation.

‐信頼度に依存する単語セグメント化‐
信頼度を用いたセグメント化３６（図１を参照）に移る前に、２つのセグメント化結果を生成した。１つは辞書を用いた方策によるものであり、１つはＩＯＢタグ付けによるものである。どちらのタグも、中国語の文８０の各単語に割当てられる。しかし、いずれも完璧ではなかった。辞書を用いたセグメント化３２の結果はＲ−ｉｖは高いもののＲ−ｏｏｖは低く、一方でＩＯＢタグ付け３４は逆の結果となった。 -Word segmentation depending on reliability-
Before moving on to segmentation 36 using confidence (see FIG. 1), two segmentation results were generated. One is based on a policy using a dictionary, and one is based on IOB tagging. Both tags are assigned to each word in the Chinese sentence 80. However, none was perfect. The result of segmentation 32 using the dictionary was high in R-iv but low in R-oov, while IOB tagging 34 had the opposite result.

この実施の形態では、２つの結果を組合せるために、信頼性尺度の方策を導入する。辞書を用いたセグメント化３２の結果を用いて、ＩＯＢタグ付け３４で生成された結果の信頼性を測定するために、信頼性尺度ＣＭ_ｉｏｂ（ｔ｜ｗ）を定義する。信頼性尺度は２つのソースから得られる。すなわち、ＩＯＢタグ付け３４と辞書を用いた単語セグメント化３２とである。計算は以下のように定義される。 In this embodiment, a measure of reliability measure is introduced to combine the two results. A reliability measure CM _iob (t | w) is defined to measure the reliability of the results generated by IOB tagging 34 using the results of segmentation 32 using the dictionary. Reliability measures are obtained from two sources. That is, IOB tagging 34 and word segmentation 32 using a dictionary. The calculation is defined as follows:

ここでｔ_ｉｏｂはＩＯＢタグ付け３４によって割当てられた単語ｗのＩＯＢタグであり、ｔ_ｗは辞書を用いたセグメント化３２の結果により決定された先のＩＯＢタグである。辞書を用いたセグメント化３２の後、単語はＦＭＭによりサブワードを用いたＩＯＢタグ付けモジュール８８内でサブワードに再セグメント化され、その後ＩＯＢタグ付け３４に与えられる。各サブワードには先のＩＯＢタグｔ_ｗ．ＣＭ_ｉｏｂ（ｔ｜ｗ）が与えられており、ＩＯＢタグ付けの処理で導かれる信頼性確率は以下のように定義される。

Where t _ioob is the IOB tag for the word w assigned by IOB tagging 34 and t _w is the previous IOB tag determined by the result of segmentation 32 using the dictionary. After segmentation 32 using the dictionary, the words are re-segmented into subwords within the IOB tagging module 88 using subwords by FMM and then provided to IOB tagging 34. Each subword has a previous IOB tag t _w . CM _iob (t | w) is given, and the reliability probability derived by the IOB tagging process is defined as follows.

ここで分子は単語ｗ_ｉがｔとしてラベル付けされた全ての観測シーケンスの和である。

Here the numerator is the sum of all observation sequences where the word w _i is labeled as t.

は辞書を用いたセグメント化の寄与度を示す。これは以下で定義されるクロネッカーのデルタである。

Indicates the contribution of segmentation using a dictionary. This is the Kronecker delta defined below.

式３において、αはＩＯＢタグ付け３４と辞書を用いたセグメント化３２との間の重みである。αについて経験的に０．７の値が決定された。

In Equation 3, α is the weight between IOB tagging 34 and segmentation 32 using the dictionary. An empirical value of 0.7 was determined for α.

式３により、ＩＯＢタグ付けの結果が再評価される。値に基づいて決定するために、信頼性尺度のしきい値Ｔ_ＴＨが定義された。もし値がＴ_ＴＨより低ければ、ＩＯＢタグは棄却され、辞書を用いたセグメント化が使用される。そうでなければ、ＩＯＢタグ付けセグメント化が使用される。 Equation 3 reevaluates the result of IOB tagging. In order to make a decision based on the value, a threshold T _TH of the reliability measure was defined. If If the value is lower than T _TH, IOB tag is rejected, segmentation using the dictionary is used. Otherwise, IOB tagging segmentation is used.

新たなＯＯＶがこうして生成される。２つの極端な事例、すなわち、Ｔ_ＴＨ＝０はＩＯＢタグ付けの場合であり、Ｔ_ＴＨ＝１は辞書を用いた方策である。実際の応用では、信頼性のしきい値をチューニングすることによって、満足のいくＲ−ｉｖとＲ−ｏｏｖのトレードオフを見出すことができる。 A new OOV is thus generated. Two extreme cases, i.e., T _TH = 0 is for IOB tagging, and T _TH = 1 is a dictionary-based strategy. In practical applications, a satisfactory tradeoff between R-iv and R-oov can be found by tuning the reliability threshold.

図６は信頼性尺度を用いたセグメント化モジュール９０の構成を示す。図６を参照して、信頼性尺度を用いるセグメント化モジュール９０は、中国語の文８０の単語の各々についてタグＴ_ｉｏｂとＴ_ｗとを比較して比較結果を出力する比較モジュール１８０と、比較モジュール１８０の出力に式３を適用して信頼性尺度を計算するための信頼性尺度計算モジュール１８４と、式３で用いられるαのための記憶部１８２と、信頼性尺度計算モジュール１８４の出力としきい値Ｔ_ＴＨとの比較に依存して、各単語についてタグＴ_ｉｏｂ又はＴ_ｗのいずれかを選択するタグ選択モジュール１８６と、しきい値Ｔ_ＴＨを記憶する記憶部１８８と、を含む。 FIG. 6 shows the configuration of the segmentation module 90 using a reliability measure. Referring to FIG. 6, the segmentation module 90 using the reliability measure compares the tags T _iob and T _w for each word of the Chinese sentence 80 and outputs a comparison result, and the comparison module 180. The reliability measure calculation module 184 for calculating the reliability measure by applying Equation 3 to the output of the module 180, the storage unit 182 for α used in Equation 3, and the output of the reliability measure calculation module 184 depending on the comparison of the threshold _{T TH,} it includes a tag selection module 186 for selecting one of the tag _{T iob} or _{T w} for each word, a storage unit 188 for storing the threshold value _{T TH,} the.

＜動作＞
図１から図６を参照して、この実施の形態の中国語単語セグメント化装置５０は以下のように動作する。 <Operation>
With reference to FIGS. 1 to 6, the Chinese word segmentation device 50 of this embodiment operates as follows.

中国語単語セグメント化装置５０の動作は２段階である。すなわち、トレーニングとセグメント化である。 The operation of the Chinese word segmentation device 50 is in two stages. Training and segmentation.

‐トレーニング段階‐
トレーニング段階では、図２及び図３で示したトレーニングデータ６０が前もって準備される。頻度カウントモジュール１１０がトレーニングデータ６０中の各単語の発生頻度を計数し、単語のリストを出力する。リストの単語の各々はそれに割当てられた頻度を有する。 -Training stage-
In the training stage, the training data 60 shown in FIGS. 2 and 3 is prepared in advance. The frequency counting module 110 counts the occurrence frequency of each word in the training data 60 and outputs a word list. Each word in the list has a frequency assigned to it.

ソートモジュール１１２は単語リストを頻度の降順でソートする。結果として得られる順序付きリスト１１４は記憶部に記憶される。 The sort module 112 sorts the word list in descending order of frequency. The resulting ordered list 114 is stored in the storage unit.

選択モジュール１１６は全ての一文字単語と上位２０００個の複数文字単語とを選択モジュール１１６から選択し、サブワードリスト６４を生成する。 The selection module 116 selects all single-character words and the top 2000 multi-character words from the selection module 116 and generates a subword list 64.

ＩＯＢタグ付け及びセグメント化モジュール１２０はトレーニングデータ６０中の文の各々を、サブワードリスト６４をセグメント化のためのレキシコンとして用いて再セグメント化する。結果として得られるサブワードトレーニングデータ１２２は記憶部に記憶される。トレーニングモジュール１２４はサブワードトレーニングデータ１２２を用いて確率モデル６６をトレーニングする。このトレーニングにより、確率モデル６６は、上述のとおり、Ｌ−ＢＦＧＳ勾配効果最適化法を用いてトレーニングデータの対数尤度が最大化されるようにトレーニングされる。 The IOB tagging and segmentation module 120 re-segments each sentence in the training data 60 using the subword list 64 as a lexicon for segmentation. The resulting subword training data 122 is stored in the storage unit. Training module 124 trains probability model 66 using subword training data 122. By this training, the probability model 66 is trained so that the log likelihood of the training data is maximized using the L-BFGS gradient effect optimization method as described above.

‐セグメント化段階‐
第２段階では、図２を参照して、中国語の文８０が辞書を用いた単語セグメント化モジュール８６に与えられる。図４を参照して、仮説生成モジュール１４０が中国語辞書８２をセグメント化のためのレキシコンとして用いて、中国語の文８０の可能なセグメント化仮説全てを生成する。仮説は記憶部１４２に記憶される。尤度計算モジュール１４４は統計的言語モデル８４のＮグラム確率を用いて各仮説の尤度を計算する。最尤選択モジュール１４６が尤度の最も高い仮説を選択する。尤度計算モジュール１４４と最尤選択モジュール１４６との機能はソフトウェアで実現され、実際にはビーム探索法を用いて、最も尤度の高い仮説が、全ての仮説の尤度を計算することなく選択されることに注意されたい。 -Segmentation stage-
In the second stage, referring to FIG. 2, a Chinese sentence 80 is provided to a word segmentation module 86 using a dictionary. Referring to FIG. 4, hypothesis generation module 140 generates all possible segmentation hypotheses of Chinese sentence 80 using Chinese dictionary 82 as a lexicon for segmentation. The hypothesis is stored in the storage unit 142. The likelihood calculation module 144 calculates the likelihood of each hypothesis using the N-gram probability of the statistical language model 84. Maximum likelihood selection module 146 selects the hypothesis with the highest likelihood. The functions of the likelihood calculation module 144 and the maximum likelihood selection module 146 are realized by software. In practice, the most likely hypothesis is selected without calculating the likelihood of all hypotheses using the beam search method. Please note that.

選択された仮説はサブワードを用いたＩＯＢタグ付けモジュール８８に与えられる。ＩＯＢタグｔ_ｗが仮説中の単語の各々に割当てられる。 The selected hypothesis is provided to the IOB tagging module 88 using subwords. IOB tag _{t w} is assigned to each of the words in the hypothesis.

図５を参照して、仮説生成モジュール１６０はサブワードリスト６４をセグメント化のためのレキシコンとして用いて可能な単語セグメント化仮説全てを生成する。仮説は記憶部１６２に記憶される。尤度計算モジュール１６４は確率モデル６６を用いて記憶部１６２に記憶された仮説の各々の尤度を計算する。最尤選択モジュール１６６は最も尤度の高い仮説を選択する。実際には尤度計算モジュール１６４と最尤選択モジュール１６６とはソフトウェアで実現され、ビーム探索法を用いるので、全ての仮説の尤度を計算する必要はないことに注意されたい。 Referring to FIG. 5, hypothesis generation module 160 generates all possible word segmentation hypotheses using subword list 64 as a lexicon for segmentation. The hypothesis is stored in the storage unit 162. The likelihood calculation module 164 calculates the likelihood of each hypothesis stored in the storage unit 162 using the probability model 66. Maximum likelihood selection module 166 selects the most likely hypothesis. It should be noted that the likelihood calculation module 164 and the maximum likelihood selection module 166 are actually implemented by software and use the beam search method, so it is not necessary to calculate the likelihood of all hypotheses.

サブワードを用いたＩＯＢタグ付けモジュール８８の動作の結果として、サブワードを用いたセグメント化が行われ、仮説が出力される。サブワードを用いたタグＴ_ｉｏｂと信頼性尺度ＣＭ_ｉｏｂ（ｔ_ｉｏｂ｜ｗ）とが仮説中の各単語に割当てられる。 As a result of the operation of the IOB tagging module 88 using subwords, segmentation using subwords is performed and a hypothesis is output. A tag T _iob using _subwords and a reliability measure CM _iob (t _iob | w) is assigned to each word in the hypothesis.

図６を参照して、比較モジュール１８０が仮説中の各単語のタグＴ_ｉｏｂとｔ_ｗとを比較する。比較モジュール１８０は比較結果を出力する。信頼性尺度計算モジュール１８４は、ＣＭ_ｉｏｂ（ｔ_ｉｏｂ｜ｗ）と、記憶部１８２内のαと、比較モジュール１８０の出力とを用いて式３により信頼性尺度ＣＭ（Ｔ_ｉｏｂ｜ｗ）を計算する。式３の結果得られる値がタグ選択モジュール１８６に与えられる。 Referring to FIG. 6, the comparison module 180 compares the tag _{T iob} and _{t w} of each word in the hypothesis. The comparison module 180 outputs the comparison result. The reliability measure calculation module 184 calculates a reliability measure CM (T _iob | w) according to Equation 3 using CM _iob (t _iob | w), α in the storage unit 182 and the output of the comparison module 180. To do. The resulting value of Equation 3 is provided to the tag selection module 186.

タグ選択モジュール１８６は信頼性尺度計算モジュール１８４から与えられた値を記憶部１８８に記憶されたしきい値Ｔ_ＴＨと比較する。もし式３の値がＴ_ＴＨより低い場合には、ＩＯＢタグＴ_ｉｏｂが拒絶され、辞書を用いたセグメント化タグｔ_ｗが使用される。そうでなければ、ＩＯＢタグ付けセグメント化タグＴ_ｉｏｂが使用される。 The tag selection module 186 compares the value given from the reliability measure calculation module 184 with the threshold value T _TH stored in the storage unit 188. If the value of the formula 3 is less than _{T TH} is, IOB tag _{T iob} is rejected, segmented tag _{t w} using a dictionary is used. Otherwise, the IOB tagging segmentation tag T _iob is used.

＜実験＞
Ｓｉｇｈａｎコンテスト２００５によって与えられたデータを用いて、前のセクションで説明したこの発明の方策をテストした。データは、異なるソースからの４つのコーパスを含む。すなわち、アカデミアシニカ（ＡＳ）、香港市立大学（ＣＩＴＹＵ）、北京大学（ＰＫＵ）及びマイクロソフトリサーチ北京（ＭＳＲ：「マイクロソフト」は登録商標）である。この課題は提案されたサブワードを用いたＩＯＢタグ付けの評価を行うことが目的であるので、クローズドテストのみを行った。セグメント化の結果評価には５つの測定指標を用いた。すなわち、再現率（ｒｅｃａｌｌ：Ｒ）、精度（ｐｒｅｃｉｓｉｏｎ：Ｐ）、Ｆ−スコア（Ｆ）、ＯＯＶ率（Ｒ−ｏｏｖ）、及びＩＶ率（Ｒ−ｉｖ）である。コーパスとこれらのスコアの詳細については非特許文献１を参照されたい。 <Experiment>
The data provided by the Sighan Contest 2005 was used to test the inventive strategy described in the previous section. The data includes four corpora from different sources. That is, Academia Sinica (AS), Hong Kong City University (CITYU), Peking University (PKU), and Microsoft Research Beijing (MSR: “Microsoft” is a registered trademark). Since this task is aimed at evaluating the IOB tagging using the proposed subword, only the closed test was performed. Five measurement indices were used to evaluate the segmentation results. That is, the reproduction rate (recall: R), accuracy (precise: P), F-score (F), OOV rate (R-oov), and IV rate (R-iv). See Non-Patent Document 1 for details of the corpus and these scores.

辞書を用いた方策では、トレーニングデータから単語リストを語彙として抽出した。多義性解消のため、ＳＲＩＬＭツールキットを用いて、トライグラムＬＭが生成された。 In the policy using a dictionary, a word list was extracted from training data as a vocabulary. To eliminate ambiguity, a trigram LM was generated using the SRILM toolkit.

表１は、辞書を用いたセグメント化の性能を示す。テストデータにはいくつかの一文字単語があるが、トレーニングデータにはないので、この実験ではＲ−ｏｏｖ率はゼロではない。実際、ＯＯＶの認識はなかった。従って、この方策ではＦスコアが低くなった。しかしながら、Ｒ−ｉｖはきわめて高かった。 Table 1 shows the segmentation performance using a dictionary. There are some single-letter words in the test data, but not in the training data, so the R-oov rate is not zero in this experiment. In fact, there was no recognition of OOV. Therefore, this measure resulted in a low F score. However, R-iv was very high.

‐文字を用いた、及びサブワードを用いたタグ付け部の効果‐
文字を用いるかサブワードを用いるかの主な違いは、再セグメント化のために用いられるレキシコンサブセットの内容である。文字を用いたタグ付けでは、辞書中の全ての中国語の文字を用いた。サブワードを用いたタグ付けでは、タグ付けのため、最も頻度の高い２０００個の複数文字単語をレキシコンに付加した。辞書を用いたセグメント化の結果はＦＭＭを用いて再セグメント化され、その後ＣＦＲによって「ＩＯＢ」タグでラベル付けされた。ＣＲＦタグ付けを用いたセグメント化の結果を表２に示す。ここで各スロットの上の数は文字を用いた方策で生成されたものであり、下の数はサブワードを用いたものである。提案されたサブワードを用いた方策はＣＩＴＹＵコーパスとＭＳＲコーパスでは有効であり、ＦスコアをＣＩＴＹＲでは０．９４１から０．９４６に、ＭＳＲでは０．９５９から０．９６４に向上させた。ＡＳ及びＰＫＵコーパスではＦスコアの変化はなかったが、再現率が向上した。表１と表２とを比較すると、ＣＲＦモデル化されたＩＯＢタグ付けが、辞書を用いた方策よりもよいセグメント化を生じさせたことが分かる。しかし、Ｒ−ｏｏｖ率が向上するにつれてＲ−ｉｖ率は悪化した。信頼性尺度の方策を用いてこの問題に取り組む。

-Effect of tagging using letters and subwords-
The main difference between using letters or subwords is the contents of the lexicon subset used for resegmentation. In tagging using characters, all Chinese characters in the dictionary were used. In tagging using subwords, the most frequent 2000 multi-letter words were added to the lexicon for tagging. The results of segmentation using the dictionary were re-segmented using FMM and then labeled with the “IOB” tag by CFR. The results of segmentation using CRF tagging are shown in Table 2. Here, the upper number in each slot is generated by a strategy using characters, and the lower number is a subword. The proposed strategy using subwords is effective for the CITYU and MSR corpora, improving the F-score from 0.941 to 0.946 for CITYR and from 0.959 to 0.964 for MSR. The AS and PKU corpus did not change the F score, but the recall was improved. Comparing Table 1 and Table 2, it can be seen that CRF-modeled IOB tagging resulted in better segmentation than dictionary-based strategies. However, the R-iv rate deteriorated as the R-oov rate improved. Address this issue using a measure of reliability.

‐信頼性尺度の効果‐
この応用では、辞書を用いたセグメント化の結果の組合せにより、ＩＯＢタグ付けの結果を再評価するために、信頼性尺度の方策を提案する。信頼性尺度の効果を表３に示す。ここで、α＝０．７を用い、信頼性しきい値Ｔ_ＴＨ＝０．８を用いた。各スロットにおいて、上の数は文字を用いた方策のものであり、下の数はサブワードを用いたものである。

-Effect of reliability scale-
In this application, we propose a measure of reliability measure to re-evaluate the result of IOB tagging by combining the results of segmentation using a dictionary. The effect of the reliability measure is shown in Table 3. Here, α = 0.7 was used, and the reliability threshold value T _TH = 0.8 was used. In each slot, the upper number is for the strategy using letters and the lower number is for the subwords.

表３の結果は表２及び表１より良いことが分かり、信頼性尺度の方策を用いることで辞書を用いたセグメント化及びＩＯＢタグ付けの方策に対し最良の性能を達成できることが証明された。信頼性尺度の働きでＲ−ｉｖとＲ−ｏｏｖとのトレードオフがなされ、表１よりも高いＲ−ｏｏｖと表２より高いＲ−ｉｖとが得られた。しきい値Ｔ_ＴＨを変化させることで、最適なＲ−ｉｖとＲ−ｏｏｖとが得られる。

The results in Table 3 were found to be better than those in Tables 2 and 1, and it was demonstrated that the best performance could be achieved for dictionary segmentation and IOB tagging strategies using the reliability measure strategy. The trade-off between R-iv and R-oov was made by the action of the reliability measure, and R-oov higher than Table 1 and R-iv higher than Table 2 were obtained. By changing the threshold value T _TH , optimal R-iv and R-oov can be obtained.

信頼性尺度を用いた場合でも依然として、単語を用いたＩＯＢタグ付けが文字を用いたＩＯＢタグ付けより勝っていた。これは、提案されたサブワードを用いたＩＯＢタグ付けが非常に有効であることを示す。 Even with the reliability measure, IOB tagging using words still outperformed IOB tagging using letters. This indicates that IOB tagging with the proposed subword is very effective.

‐検討及び関連の著作‐
この実施の形態で採用されたＩＯＢタグ付けの方策は、新規な思想ではない。これは非特許文献５でキュー及びシェンによって中国語の単語セグメント化で初めて用いられ、ここでは最大エントロピー法が用いられた。その後、この方策は非特許文献３でペン及びマッカラムによってＣＲＦに基づく方法で実現され、これはラベルのバイアスの問題を解決することができるため、最大エントロピー法より良好な結果を達成できることが分かった（非特許文献２を参照）。 -Review and related works-
The IOB tagging strategy employed in this embodiment is not a new idea. This was first used in Chinese word segmentation by Cue and Shen in Non-Patent Document 5, where the maximum entropy method was used. Later, this strategy was implemented in CRF-based method by Pen and McCallum in Non-Patent Document 3 and it was found that better results than the maximum entropy method can be achieved because it can solve the label bias problem. (See Non-Patent Document 2).

我々が主に寄与するところは、ＩＯＢタグ付けの方策を、文字を用いたものからサブワードを用いたものに拡張することである。この新たな取組みにより、単語セグメント化が大いに向上することを証明した。我々の得た結果を、Ｆスコアについて、コンテスト２００５の最良の結果とともに表４に示す。 Our main contribution is to extend the IOB tagging strategy from using letters to using subwords. This new approach proved that word segmentation was greatly improved. Our results are shown in Table 4 along with the best results of the contest 2005 for F scores.

表４を参照して、ＣＩＴＹＵ、ＰＫＵ及びＭＳＲコーパスにおいて最高のＦスコアを達成した。この実施の形態のサブワードベースのタグ付けがこの良好な結果をもたらすのに重要な役割を果たしたものと思われる。これはクローズドテストであるため、アラビア数字、中国語の数字、アルファベット文字等のいくつかの情報は使用できない。このような情報を用いれば、表４より良好な結果が得られるであろう。例えば、アルファベットの文字が知られていれば、外国語の名前の一貫性のない誤り等を正すことができる。

Referring to Table 4, the highest F scores were achieved in the CITYU, PKU and MSR corpora. The subword-based tagging of this embodiment appears to have played an important role in providing this good result. Since this is a closed test, some information such as Arabic numerals, Chinese numerals, and alphabetic characters cannot be used. Using such information would give better results than Table 4. For example, if alphabetic characters are known, inconsistent errors in foreign language names can be corrected.

文字を用いたものにくらべサブワードを用いたＩＯＢタグ付けが有利な別の点は、その速度である。サブワードを用いた方策は、ラベル付けされる文字数より単語数のほうが少ないため、より高速である。トレーニングでもテストでも、速度が向上した。 Another advantage of IOB tagging with subwords over textual ones is their speed. The strategy using subwords is faster because the number of words is less than the number of characters to be labeled. Speed increased in both training and testing.

信頼性尺度を用いる思想は非特許文献３で見られ、ここではＯＯＶを認識するのに用いられた。発明の実施の形態では、これをよりきめ細かに用いる。信頼性尺度により、辞書を用いた結果とＩＯＢタグ付けを用いた結果とを組合せ、その結果、最適の性能を達成することができた。 The idea of using a reliability scale was found in Non-Patent Document 3, where it was used to recognize OOV. In the embodiment of the invention, this is used more finely. The reliability measure combined the results using the dictionary with the results using IOB tagging, so that optimal performance could be achieved.

この実施の形態では、中国語の単語セグメント化にサブワードを用いたＩＯＢタグ付けを提案した。ＣＲＦの方策を用いて、これが文字を用いた方法に勝ることを証明した。また、信頼度に依存する単語セグメント化を行うために、信頼性尺度を成功裏に用いた。この方策は、ユーザによるＲ−ｏｏｖとＲ−ｉｖとの要求に基づいて望ましいセグメント化を行うのに効果的である。 In this embodiment, IOB tagging using subwords for Chinese word segmentation was proposed. Using the CRF strategy, this proved to be superior to the method using letters. We have also successfully used the reliability measure to perform word segmentation depending on the reliability. This strategy is effective to achieve desirable segmentation based on user R-oov and R-iv requirements.

＜コンピュータによる実現＞
上述の実施の形態は、コンピュータシステムと、コンピュータシステム上で実行されるコンピュータプログラムとによって実現できる。図７はこの実施の形態で用いられるコンピュータシステム２５０の外観を示し、図８はコンピュータシステム２５０のブロック図である。ここで示されるコンピュータシステム２５０は単なる例示であって、他の構成が利用可能であることに注意されたい。 <Realization by computer>
The above-described embodiment can be realized by a computer system and a computer program executed on the computer system. FIG. 7 shows the external appearance of the computer system 250 used in this embodiment, and FIG. 8 is a block diagram of the computer system 250. It should be noted that the computer system 250 shown here is merely exemplary and other configurations are available.

図７を参照して、コンピュータシステム２５０はコンピュータ２６０と、全てコンピュータ２６０に接続された、モニタ２６２、キーボード２６６、マウス２６８、スピーカ５８及びマイクロフォン２９０とを含む。さらに、コンピュータ２６０はＤＶＤ−ＲＯＭ（ＤｉｇｉｔａｌＶｅｒｓａｔｉｌｅＤｉｓｃＲｅａｄＯｎｌｙＭｅｍｏｒｙ：デジタル多用途ディスク読出専用メモリ）ドライブ２７０と半導体メモリポート２７２とを含む。 Referring to FIG. 7, a computer system 250 includes a computer 260 and a monitor 262, a keyboard 266, a mouse 268, a speaker 58, and a microphone 290, all connected to the computer 260. Further, the computer 260 includes a DVD-ROM (Digital Versatile Disc Read Only Memory) drive 270 and a semiconductor memory port 272.

図８を参照して、コンピュータ２６０はさらに、ＤＶＤ−ＲＯＭドライブ２７０と半導体メモリポート２７２とに接続されたバス２８６と、全てバス２８６に接続された、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ：中央処理装置）２７６、コンピュータ２６０のブートアッププログラムを記憶するＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ：読出専用メモリ）２７８、ＣＰＵ２７６によって使用される作業領域を提供するとともにＣＰＵ２７６によって実行されるプログラムの記憶領域を提供するＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ：ランダムアクセスメモリ）２８０、言語モデル、サブワードリスト及び辞書（レキシコン）を記憶するハードディスクドライブ２７４、サウンドボード２８８、及びコンピュータ２５０にネットワーク５２との接続を提供するネットワークインターフェース（Ｉ／Ｆ）２９６とを含む。スピーカ５８とマイクロフォン２９０とはサウンドボード２８８に接続される。 Referring to FIG. 8, computer 260 further includes a bus 286 connected to DVD-ROM drive 270 and semiconductor memory port 272, and a CPU (Central Processing Unit) 276, all connected to bus 286. A ROM (Read Only Memory) 278 that stores a boot-up program of the computer 260, a RAM (Random Access Memory) that provides a work area used by the CPU 276 and a storage area for a program executed by the CPU 276 Random access memory) 280, hard disk drive 274 storing language model, subword list and dictionary (lexicon), sound board 288, and computer And a network interface (I / F) 296 that provides a connection to the network 52 to over motor 250. The speaker 58 and the microphone 290 are connected to the sound board 288.

上述の実施の形態のシステムを実現するソフトウェアは、ＤＶＤ−ＲＯＭ２８２又は半導体メモリ２８４等の記憶媒体上に記録されたオブジェクトコードの形で分配され、ＤＶＤ−ＲＯＭドライブ２７０又は半導体メモリポート２７２等の読出装置を介してコンピュータ２６０に提供され、ハードディスクドライブ２７４に記憶される。ＣＰＵ２７６がプログラムを実行する際に、プログラムはハードディスクドライブ２７４から読出され、ＲＡＭ２８０に記憶される。図示しないプログラムカウンタによって指定されたアドレスから命令がフェッチされ、命令が実行される。ＣＰＵは処理すべきデータをハードディスクドライブ２７４から読出し、処理の結果をこれもまたハードディスクドライブ２７４に記憶する。スピーカ５８及びマイクロフォン２９０は音声認識及び音声合成のために用いられる。 The software for realizing the system of the above-described embodiment is distributed in the form of an object code recorded on a storage medium such as the DVD-ROM 282 or the semiconductor memory 284, and read out from the DVD-ROM drive 270 or the semiconductor memory port 272. It is provided to the computer 260 via the device and stored in the hard disk drive 274. When the CPU 276 executes the program, the program is read from the hard disk drive 274 and stored in the RAM 280. An instruction is fetched from an address designated by a program counter (not shown), and the instruction is executed. The CPU reads data to be processed from the hard disk drive 274 and stores the processing result in the hard disk drive 274 as well. The speaker 58 and the microphone 290 are used for speech recognition and speech synthesis.

コンピュータシステム２５０の一般的動作は周知であるので、ここではその詳細は説明しない。 The general operation of computer system 250 is well known and will not be described in detail here.

ソフトウェアの分配の仕方については、これは必ずしも記録媒体上に固定されていなくてもよい。例えば、ソフトウェアはネットワークを介して接続された別のコンピュータから分配されてもよい。ソフトウェアの一部はハードディスクドライブ２７４に記憶されてもよく、残りの部分がネットワークを介してハードディスクに入れられ実行の際に統合されてもよい。 As for the distribution method of the software, this does not necessarily have to be fixed on the recording medium. For example, the software may be distributed from another computer connected via a network. A part of the software may be stored in the hard disk drive 274, and the remaining part may be put into the hard disk via the network and integrated at the time of execution.

典型的には、現代のコンピュータはコンピュータのオペレーティングシステム（ＯＳ）によって提供される一般的な機能を利用し、所望の目的に応じて制御された様態で機能を実行する。従って、ＯＳによって又はサードパーティによって提供されうる一般的な機能を含まないプログラムであって単に一般的機能を実行する命令の組合せのみを指定するプログラムもまた、そのプログラムが全体として所望の目的を達成する制御構造を有する限り、この発明の範囲に含まれることは明らかである。 Typically, modern computers utilize general functions provided by a computer operating system (OS) and perform functions in a controlled manner according to the desired purpose. Therefore, a program that does not include a general function that can be provided by the OS or by a third party and that only specifies a combination of instructions that execute the general function also achieves the desired purpose as a whole. As long as it has a control structure, it is clearly included in the scope of the present invention.

＜第２の実施の形態＞
ＣＲＦを用いた上述の実施の形態では、所与の単語シーケンスＷ＝ｗ_０ｗ_１…ｗ_Ｍに対するＩＯＢタグシーケンス、Ｔ＝ｔ_０ｔ_１…ｔ_Ｍの確率は式１で定義される。しかし、この発明の信頼性尺度はＣＲＦによって計算されるものに限定されない。ＣＲＦに代えて、最大エントロピー（ＭａｘＥｎｔ）方策を用いることもできる。 <Second Embodiment>
In the above embodiment using the CRF, IOB tag sequence for a given word sequence _{_{_{W = w 0 w 1 ... w}}} M, the probability of _{_{_{T = t 0 t 1 ... t}}} M is defined by Equation 1. However, the reliability measure of the present invention is not limited to that calculated by CRF. Instead of CRF, a maximum entropy (MaxENT) strategy can also be used.

ＭａｘＥｎｔの方策によれば、ｔが現在の単語のタグＩ、Ｏ、Ｂであり、ｈが現在の単語の前後の単語とタグシーケンスを含む文脈であるものとして、確率ｐ（ｔ｜ｈ）の数学的表現は次のようになる。 According to MaxEnt's strategy, assuming that t is the tag I, O, B of the current word and h is the context containing the word sequence and tag sequence before and after the current word, the probability p (t | h) The mathematical expression is as follows.

ここでｆ_ｉはｉ番目の定義された特徴量が活性化されていれば１に等しく、そうでなければ０に等しい二値の特徴であり、Ｚは正規化系数であり、λ_ｉはｉ番目の特徴の重みである。

Where f _i is a binary feature equal to 1 if the i th defined feature is activated, otherwise equal to 0, Z is a normalized coefficient, and λ _i is i The weight of the second feature.

この実施の形態でもまた、信頼性尺度は２つのソースから来る。サブワードを用いたＩＯＢタグづけと辞書を用いた単語セグメント化である。その計算は以下のように定義される。 In this embodiment also, the reliability measure comes from two sources. IOB tagging using subwords and word segmentation using a dictionary. The calculation is defined as follows:

ここでｔ_ｉｏｂはＩＯＢタグ付けによって割当てられた単語ｗのＩＯＢタグであり、ｔ_ｗは辞書を用いたセグメント化の結果決定された先のＩＯＢタグであり、

Where t _ioob is the IOB tag of the word w assigned by IOB tagging, t _w is the previous IOB tag determined as a result of segmentation using the dictionary,

は辞書を用いたセグメント化の寄与分であり、αはサブワードを用いたＩＯＢタグ付けと辞書を用いた単語セグメント化との重みである。

Is the contribution of segmentation using a dictionary, and α is the weight between IOB tagging using subwords and word segmentation using a dictionary.

は上述のクロネッカーのデルタ関数である。この実施の形態では、実験的にαの値を０．８とした。

Is the Kronecker delta function described above. In this embodiment, the value of α is experimentally set to 0.8.

辞書を用いた単語セグメント化の後、単語はＦＭＭによりサブワードに再セグメント化され、その後ＩＯＢタグ付けに供される。各サブワードには、先行するＩＯＢタグ、ｔ_ｗ、ＣＭ_ｉｏｂ（ｔ｜ｗ）、以下で定義されるサブワードを用いたＩＯＢタグ付けの過程で導かれる信頼度確率 After word segmentation using the dictionary, the words are re-segmented into sub-words by FMM and then subjected to IOB tagging. Each subword has a preceding IOB tag, t _w , CM _iob (t | w), a reliability probability derived in the process of IOB tagging using the subword defined below

が与えられており、ここでｈ_ｉはビーム探索における仮説である。

Where h _i is a hypothesis in the beam search.

上述のＣＭ（ｔ_ｉｏｂ｜ｗ）を用いたタグの選択は第１の実施の形態と同じである。 Tag selection using the above-described CM (t _iob | w) is the same as in the first embodiment.

タグ付けの正確さを改善するために、多種類の特徴量を定義することができる。しかし、コンテスト２００５のクローズドテストの制約に従うため、統語論的情報及び数とアルファベット文字のエンコード等は許されていない。従って、提供されたトレーニングコーパスから利用可能な特徴量のみを用いた。すなわち、文脈情報、接頭辞、接尾辞及び語長である。 In order to improve tagging accuracy, many types of features can be defined. However, in order to comply with the contest 2005 closed test constraints, syntactic information and encoding of numbers and alphabetic characters are not allowed. Therefore, only the feature quantities available from the provided training corpus were used. That is, context information, prefix, suffix, and word length.

‐文脈情報‐
ｗ_０，ｔ_−１，ｗ_０ｔ_−１，ｗ_０ｔ_−１ｗ_１，ｔ_−１ｗ_１，ｔ_−１ｔ_−２，ｗ_０ｔ_−１ｔ_−２，ｗ_０ｗ_１，ｗ_０ｗ_１ｗ_２，ｗ_−１，ｗ_０ｗ_−１，ｗ_０ｗ_−１ｗ₁,ｗ_-1ｗ₁,ｗ_-1ｗ_-2,ｗ₀ｗ_-1ｗ_-2,ｗ₁,ｗ₁ｗ₂
ここで、ｗは単語を表し、ｔはＩＯＢタグを表す。添え字は位置指標であり、０は現在の単語／タグを意味し、−１、−２は左側の１番目又は２番目の単語／タグを意味し、１、２は右側の１番目又は２番目の単語／タグを意味する。 -Context information-
w ₀ , t ₋₁ , w ₀ t ₋₁ , w ₀ t ₋₁ w ₁ , t ₋₁ w ₁ , t ₋₁ t ₋₂ , w ₀ t ₋₁ t ₋₂ , w ₀ w ₁ , w ₀ w ₁ w ₂ , w ₋₁ , w ₀ w ₋₁ , w ₀ w ₋₁ w ₁ , w ₋₁ w ₁ , w ₋₁ w ₋₂ , w ₀ w ₋₁ w ₋₂ , w ₁ , w ₁ w ₂
Here, w represents a word and t represents an IOB tag. The subscript is a position index, 0 means the current word / tag, -1, -2 means the first or second word / tag on the left side, and 1, 2 are the first or 2 on the right side. Means the second word / tag.

‐接頭辞及び接尾辞‐
これらは非常に有用な特徴量である。非特許文献４と同様の方策を用いて、接頭辞を示す「Ｂ」のタグが付された最も頻度の高い単語と、接尾辞を示す「Ｉ」のタグを付された最後の単語とを抽出した。接頭辞及び接尾辞を含む特徴は、以下の組合せで他の特徴と組合せて用いられ、ここでｐは接頭辞、ｓは接尾辞を表し、ｐ０は現在の単語が接頭辞であることを意味し、ｓ１は右側の１番目の単語が接尾辞であることを意味し、以下同様である。
ｐ₀，ｗ₀ｐ_−１，ｗ_０ｐ_１，ｓ_０，ｗ_０ｓ_−１，ｗ_０ｓ_１，ｐ_０ｗ_−１，ｐ_０ｗ_１，ｓ_０ｗ_−１，ｓ_０ｗ_−２
‐語長‐
これは単語の文字数として定義される。中国語の単語長は単語の組立てについて特徴的な役割を有する。例えば、単一文字の単語は複数文字の単語に比べて新規な単語を形成しやすい。語長を用いた特徴量を以下に列挙する。ここでＬ_０は現在の単語の語長を意味する。他のものも同様に推測できる。
Ｌ_０，ｗ_０Ｌ_−１，ｗ_０Ｌ_１，ｗ_０Ｌ_−１Ｌ_１，Ｌ_０Ｌ_−１，Ｌ_０Ｌ_１
特徴量の選択に関連して、単純にトレーニングデータ中の各特徴量の絶対カウントを測定指標として採用し、特徴量の種類の各々について切捨て値を定義した。 -Prefix and suffix-
These are very useful features. Using the same strategy as in Non-Patent Document 4, the most frequent word tagged with the prefix “B” and the last word tagged with “I” indicating the suffix Extracted. Features that include prefixes and suffixes are used in combination with other features in the following combinations, where p represents a prefix, s represents a suffix, and p0 means the current word is a prefix: S1 means that the first word on the right is a suffix, and so on.
p ₀ , w ₀ p ₋₁ , w ₀ p ₁ , s ₀ , w ₀ s ₋₁ , w ₀ s ₁ , p ₀ w ₋₁ , p ₀ w ₁ , s ₀ w ₋₁ , s ₀ w ₋₂
-word length-
This is defined as the number of characters in the word. Chinese word length has a characteristic role in word composition. For example, a single character word is easier to form a new word than a multiple character word. The feature quantities using word length are listed below. Here, L ₀ means the word length of the current word. Others can be guessed as well.
L ₀ , w ₀ L ₋₁ , w ₀ L ₁ , w ₀ L ₋₁ L ₁ , L ₀ L ₋₁ , L ₀ L ₁
In relation to the selection of feature values, the absolute count of each feature value in the training data was simply adopted as a measurement index, and a truncation value was defined for each type of feature value.

最大エントロピーモデルをトレーニングするために、ＩＩＳ（ＩｍｐｒｏｖｅｄＩｔｅｒａｔｉｖｅＳｃａｌｉｎｇａｌｇｏｒｉｔｈｍ：改良反復スケーリングアルゴリズム）を用いた。詳細は、ラファルティら、２００１年の非特許文献２を参照されたい。 IIS (Improved Iterative Scaling algorithm) was used to train the maximum entropy model. For details, see Rafalty et al., 2001, Non-Patent Document 2.

タグ付けアルゴリズムはビーム探索法（非特許文献６）に基づく。ＩＯＢタグ付けの後、各単語にＢ／Ｉ／Ｏのタグが付される。単語セグメント化がすぐに得られる。 The tagging algorithm is based on the beam search method (Non-Patent Document 6). After IOB tagging, each word is tagged with B / I / O. Word segmentation is readily available.

３種類の実験を行った。第１の実験では、サブワードリストに全ての中国語の文字を含めた。第２の実験では、サブワードリストにさらに２５００個のトレーニングコーパス中最も頻度の高い複数文字単語を含めた。第３の実験では、別の２５００個の最も頻度の高い複数文字単語を含めた。ここでα＝０．８、しきい値Ｔ_ＴＨ＝０．７を用いた。 Three types of experiments were conducted. In the first experiment, all Chinese characters were included in the subword list. In the second experiment, the subword list further included the most frequent multi-letter words in 2500 training corpora. In the third experiment, another 2500 most frequent multi-letter words were included. Here, α = 0.8 and threshold value T _TH = 0.7 were used.

表５、６、７はコンテスト２００５のクローズドテストでの辞書を用いたセグメント化の結果、純粋にサブワードを用いたＩＯＢタグ付けの結果、及び第２の実施の形態に従った信頼性尺度を用いたセグメント化の結果を示す。これらの表において、ＡＳ、ＣＩＴＹＵ、ＰＫＵ及びＭＳＲはそれぞれ、アカデミアシニカコーパス、香港市立大学コーパス、北京大学コーパス及びマイクロソフトリサーチコーパスである（「マイクロソフト」は登録商標）。これらの表において、第１、第２、第３の実験結果を各スロット内の上、中、下の行で示す。 Tables 5, 6, and 7 use the results of segmentation using the dictionary in the contest 2005 closed test, the results of IOB tagging using purely subwords, and the reliability measure according to the second embodiment. Shows the result of the segmentation. In these tables, AS, CITYU, PKU and MSR are respectively the Academia Sinica Corpus, Hong Kong City University Corpus, Peking University Corpus and Microsoft Research Corpus ("Microsoft" is a registered trademark). In these tables, the results of the first, second, and third experiments are shown in the upper, middle, and lower rows in each slot.

結果からわかるように、サブワードを用いた方策は文字を用いたＩＯＢタグ付けに勝った。信頼性尺度を用いた場合でも、文字を用いたＩＯＢタグ付けに勝り、提案されたサブワードを用いたＩＯＢタグ付けが極めて有効であることが分かった。信頼性尺度のもとでの改善は減少したが、依然としてかなりのものである。

As can be seen from the results, the strategy using subwords was superior to IOB tagging using letters. Even when the reliability measure was used, it was found that the IOB tagging using the proposed subword was extremely effective over the IOB tagging using characters. Improvements under the reliability scale have decreased, but are still substantial.

この実施の形態でも、しきい値Ｔ_ＴＨを変化させることによって、Ｒ−ｉｖとＲ−ｏｏｖとを最適化することができる。 Also in this embodiment, R-iv and R-oov can be optimized by changing the threshold value T _TH .

今回開示された実施の形態は単に例示であって、本発明が上記した実施の形態のみに制限されるわけではない。本発明の範囲は、発明の詳細な説明の記載を参酌した上で、特許請求の範囲の各請求項によって示され、そこに記載された文言と均等の意味および範囲内でのすべての変更を含む。 The embodiment disclosed herein is merely an example, and the present invention is not limited to the above-described embodiment. The scope of the present invention is indicated by each of the claims after taking into account the description of the detailed description of the invention, and all modifications within the meaning and scope equivalent to the wording described therein are intended. Including.

この発明の一実施の形態に従った中国語の単語セグメント化の全体処理を示す図である。It is a figure which shows the whole process of Chinese word segmentation according to one embodiment of this invention. この発明の実施の形態の中国語単語セグメント化装置５０を示すブロック図である。It is a block diagram which shows the Chinese word segmentation apparatus 50 of embodiment of this invention. 図２のモデルトレーニングモジュール６２の機能的ブロック図である。FIG. 3 is a functional block diagram of the model training module 62 of FIG. 図２に示す辞書を用いたセグメント化モジュール８６の機能的ブロック図である。FIG. 3 is a functional block diagram of a segmentation module 86 using the dictionary shown in FIG. 図２に示すサブワードを用いたＩＯＢタグ付けモジュール８８の機能的ブロック図である。FIG. 3 is a functional block diagram of an IOB tagging module 88 using the subword shown in FIG. 図２に示す信頼性尺度を用いたセグメント化モジュール９０の機能的ブロック図である。FIG. 3 is a functional block diagram of a segmentation module 90 using the reliability measure shown in FIG. この発明の実施の形態を実現するコンピュータシステム２５０の正面図である。It is a front view of the computer system 250 which implement | achieves embodiment of this invention. コンピュータシステム２５０のハードウェアブロック図である。2 is a hardware block diagram of a computer system 250. FIG.

Explanation of symbols

３２辞書を用いたセグメント化
３４ＩＯＢタグ付け
３６信頼度を用いた単語のセグメント化
５０中国語単語セグメント化装置
６０トレーニングデータ
６２モデルトレーニングモジュール
６４サブワードリスト
６６確率モデル
８６辞書を用いた単語セグメント化モジュール
８８サブワードを用いたＩＯＢタグ付けモジュール
９０信頼性尺度を用いたセグメント化モジュール
９２セグメント化結果
１６０仮説生成モジュール
１６４尤度計算モジュール
１６６最尤選択モジュール
１８０比較モジュール
１８４信頼性尺度計算モジュール
１８６タグ選択モジュール 32 Segmentation using a dictionary 34 IOB tagging 36 Word segmentation using reliability 50 Chinese word segmentation device 60 Training data 62 Model training module 64 Subword list 66 Probability model 86 Word segmentation module using dictionary 88 IOB Tagging Module 90 Using Subwords 90 Segmentation Module Using Reliability Measure 92 Segmentation Result 160 Hypothesis Generation Module 164 Likelihood Calculation Module 166 Maximum Likelihood Selection Module 180 Comparison Module 184 Reliability Scale Calculation Module 186 Tag Selection Module

Claims

An apparatus for segmenting a Chinese character sequence into a Chinese word sequence,
A first storage unit for storing a Chinese subword list enumerating Chinese characters and a plurality of Chinese characters;
A second storage unit for storing a statistical probability model of a sequence of first tags assigned to a Chinese subword, wherein the first tag is an independent word, a second of a multi-character word 1 subword, or otherwise, the device further comprises:
Segmenting means using subwords for segmenting a Chinese character sequence into a first Chinese word sequence by maximum likelihood estimation using the subword list and the statistical probability model; The sub-words in the Chinese word sequence are each segmented into sub-words labeled with the first tag according to segmentation, and the words in the Chinese sub-word list used the sub-words A device that is treated as a subword when segmenting Chinese character sequences by segmentation means.

An apparatus for segmenting a Chinese character sequence into a Chinese word sequence,
The segmentation means using the subword outputs a predefined reliability probability for each of the first tags, and the apparatus further comprises:
A third storage unit for storing a Chinese character dictionary enumerating Chinese characters and Chinese words;
A fourth storage unit for storing a statistical language model of Chinese;
The input Chinese character sequence is segmented into a second Chinese word sequence by maximum likelihood estimation using the dictionary and the language model, and each character in the second Chinese word sequence is segmented. A word segmenting means using a dictionary for attaching the second tag, wherein the second tag is a character that functions as an independent word, as the first character of a multi-character word, or otherwise Wherein the device further comprises
Determining means for determining a tag to be assigned to each of the subwords in the first Chinese word sequence as a function of the first and second tags both assigned to the subword and the reliability probability of the subword; apparatus.

The segmenting means using the subword outputs a predefined reliability probability CM _iob (t | w _i ) for each of the first tags t of the i-th word w _i according to the following equation: ,

Where the numerator is the sum of all observation sequences where the word w _i is labeled t, W is the segmented Chinese word sequence, and T = t _o t ₁ ... t _M is the word sequence The apparatus of claim 2, wherein the apparatus is a tag sequence assigned to W.

The segmenting means using the subword outputs a predefined reliability probability CM _iob (t | w) for each of the first tags t of the word w according to the following equation:

The apparatus according to claim 2, wherein _hi is the i-th hypothesis of segmentation.