JPWO2015166606A1

JPWO2015166606A1 - Natural language processing system, natural language processing method, and natural language processing program

Info

Publication number: JPWO2015166606A1
Application number: JP2015512822A
Authority: JP
Inventors: 正人萩原
Original assignee: Rakuten Inc
Current assignee: Rakuten Group Inc
Priority date: 2014-04-29
Filing date: 2014-12-08
Publication date: 2017-04-20
Anticipated expiration: 2034-12-08
Also published as: TWI567569B; JP5809381B1; KR101729461B1; TW201544976A; WO2015166606A1; KR20160124237A; CN106030568A; CN106030568B

Abstract

一実施形態に係る自然言語処理システムは、解析部および修正部を備える。解析部は、分割モデルを用いて一つの文に対する形態素解析を実行することで、該一つの文を分割して得られる個々の被分割要素にタグを設定する。分割モデルは、被分割要素とタグとの対応を示す出力素性のスコアと、連続する二つの被分割要素に対応する二つのタグの組合せを示す遷移素性のスコアとを含む。修正部は、解析部により得られた解析結果で示されるタグと、一つの文の正解のタグを示す正解データとを比較し、不正解のタグに対応する正解のタグに関連する素性のスコアを高くし、該不正解のタグに関連する素性のスコアを低くすることで、解析部による次の文の形態素解析で用いられる分割モデルを修正する。A natural language processing system according to an embodiment includes an analysis unit and a correction unit. The analysis unit performs morphological analysis on one sentence using the division model, thereby setting a tag for each divided element obtained by dividing the one sentence. The division model includes an output feature score indicating the correspondence between the divided element and the tag, and a transition feature score indicating a combination of two tags corresponding to two consecutive divided elements. The correction unit compares the tag indicated by the analysis result obtained by the analysis unit with the correct data indicating the correct tag of one sentence, and scores of the features related to the correct tag corresponding to the incorrect tag Is increased, and the score of the feature related to the incorrect answer tag is decreased, thereby correcting the division model used in the morphological analysis of the next sentence by the analysis unit.

Description

本発明の一側面は、自然言語処理システム、自然言語処理方法、および自然言語処理プログラムに関する。 One aspect of the present invention relates to a natural language processing system, a natural language processing method, and a natural language processing program.

自然言語処理の基礎技術の一つとして、文を形態素の列に分割して各形態素の品詞を判定する形態素解析が知られている。これに関連して下記特許文献１には、入力されたテキストデータを形態素に分解し、形態素辞書を参照して当該分解された形態素に対応する位置の情報を取得し、位置情報を用いたコスト関数により、当該分解で得られた形態素列の候補から形態素列を決定する形態素解析装置が記載されている。 As one of basic techniques of natural language processing, morphological analysis is known in which a sentence is divided into morpheme strings and the part of speech of each morpheme is determined. In relation to this, Patent Document 1 below discloses that the input text data is decomposed into morphemes, the position information corresponding to the decomposed morphemes is obtained by referring to the morpheme dictionary, and the cost using the position information is obtained. There is described a morpheme analyzer that determines a morpheme string from morpheme string candidates obtained by the decomposition by a function.

特開２０１３−２１０８５６号公報JP 2013-210856 A

形態素解析は、各素性のスコアを含む分割モデルを用いて実行される。形態素解析のための知識ともいえるその分割モデルは一般に予め固定されているので、その分割モデルでは網羅していない新たな分野に属する文または新たな性質を持つ文を形態素解析しようとすると、当然ながら正しい結果を得ることは非常に困難である。一方で、分割モデルを機械学習などの手法により修正しようとすると、その修正に要する時間が予測の付かないほどに増大する可能性がある。そこで、形態素解析の分割モデルを一定の時間内に自動的に修正することが望まれている。 The morphological analysis is executed using a division model including a score of each feature. Since the division model, which can be said to be knowledge for morphological analysis, is generally fixed in advance, naturally, when trying to analyze a sentence belonging to a new field that is not covered by the division model or a sentence having a new property, Obtaining the correct result is very difficult. On the other hand, if the division model is to be corrected by a method such as machine learning, there is a possibility that the time required for the correction will increase so as not to be predicted. Therefore, it is desired to automatically correct the division model for morphological analysis within a certain time.

本発明の一側面に係る自然言語処理システムは、１以上のトレーニングデータを用いた機械学習により得られる分割モデルを用いて、一つの文に対する形態素解析を実行することで、該一つの文を分割して得られる個々の被分割要素に、少なくとも単語の品詞を示すタグを設定する解析部であって、分割モデルが、被分割要素とタグとの対応を示す出力素性のスコアと、連続する二つの被分割要素に対応する二つのタグの組合せを示す遷移素性のスコアとを含む、該解析部と、解析部により得られた解析結果で示されるタグと、一つの文の正解のタグを示す正解データとを比較し、不正解のタグに対応する正解のタグに関連する出力素性のスコアおよび遷移素性のスコアを現在値よりも高くし、該不正解のタグに関連する出力素性のスコアおよび遷移素性のスコアを現在値よりも低くすることで、解析部による次の文の形態素解析で用いられる分割モデルを修正する修正部とを備える。 A natural language processing system according to one aspect of the present invention divides a single sentence by executing morphological analysis on a sentence using a division model obtained by machine learning using one or more training data. An analysis unit that sets at least a tag indicating the part of speech of each word to be obtained, and the division model includes an output feature score indicating the correspondence between the divided element and the tag, and two consecutive features. The analysis unit including a transition feature score indicating a combination of two tags corresponding to one divided element, a tag indicated by an analysis result obtained by the analysis unit, and a correct tag of one sentence Compared with the correct answer data, the score of the output feature and the transition feature score related to the correct tag corresponding to the tag of the incorrect answer are set higher than the current value, and the score of the output feature related to the tag of the incorrect answer is set. Score fine transition feature is made lower than the current value, and a correction unit for correcting the divided model used by morphological analysis of the next sentence by the analysis unit.

本発明の一側面に係る自然言語処理方法は、プロセッサを備える自然言語処理システムにより実行される自然言語処理方法であって、１以上のトレーニングデータを用いた機械学習により得られる分割モデルを用いて、一つの文に対する形態素解析を実行することで、該一つの文を分割して得られる個々の被分割要素に、少なくとも単語の品詞を示すタグを設定する解析ステップであって、分割モデルが、被分割要素とタグとの対応を示す出力素性のスコアと、連続する二つの被分割要素に対応する二つのタグの組合せを示す遷移素性のスコアとを含む、該解析ステップと、解析ステップにおいて得られた解析結果で示されるタグと、一つの文の正解のタグを示す正解データとを比較し、不正解のタグに対応する正解のタグに関連する出力素性のスコアおよび遷移素性のスコアを現在値よりも高くし、該不正解のタグに関連する出力素性のスコアおよび遷移素性のスコアを現在値よりも低くすることで、解析ステップにおける次の文の形態素解析で用いられる分割モデルを修正する修正ステップとを含む。 A natural language processing method according to an aspect of the present invention is a natural language processing method executed by a natural language processing system including a processor, and uses a division model obtained by machine learning using one or more training data. , An analysis step of setting a tag indicating at least a part of speech of each word to each divided element obtained by dividing the one sentence by executing a morphological analysis on one sentence, wherein the division model is: The analysis step includes an output feature score indicating a correspondence between a divided element and a tag, and a transition feature score indicating a combination of two tags corresponding to two consecutive divided elements. The tag shown in the analysis result is compared with the correct data indicating the correct tag of one sentence, and the output element related to the correct tag corresponding to the incorrect tag The score of the next feature and the score of the transition feature are made higher than the current value, and the score of the output feature and the transition feature related to the incorrect answer tag are made lower than the current value. And a modification step for modifying the division model used in the analysis.

本発明の一側面に係る自然言語処理プログラムは、１以上のトレーニングデータを用いた機械学習により得られる分割モデルを用いて、一つの文に対する形態素解析を実行することで、該一つの文を分割して得られる個々の被分割要素に、少なくとも単語の品詞を示すタグを設定する解析部であって、分割モデルが、被分割要素とタグとの対応を示す出力素性のスコアと、連続する二つの被分割要素に対応する二つのタグの組合せを示す遷移素性のスコアとを含む、該解析部と、解析部により得られた解析結果で示されるタグと、一つの文の正解のタグを示す正解データとを比較し、不正解のタグに対応する正解のタグに関連する出力素性のスコアおよび遷移素性のスコアを現在値よりも高くし、該不正解のタグに関連する出力素性のスコアおよび遷移素性のスコアを現在値よりも低くすることで、解析部による次の文の形態素解析で用いられる分割モデルを修正する修正部としてコンピュータを機能させる。 A natural language processing program according to an aspect of the present invention divides a sentence by executing a morphological analysis on a sentence using a division model obtained by machine learning using one or more training data. An analysis unit that sets at least a tag indicating the part of speech of each word to be obtained, and the division model includes an output feature score indicating the correspondence between the divided element and the tag, and two consecutive features. The analysis unit including a transition feature score indicating a combination of two tags corresponding to one divided element, a tag indicated by an analysis result obtained by the analysis unit, and a correct tag of one sentence Compare the correct answer data, and set the score of the output feature and the transition feature related to the correct tag corresponding to the tag of the incorrect answer higher than the current value, and the score of the output feature related to the tag of the incorrect answer Score preliminary transition feature is made lower than the current value, it causes a computer to function as a correction unit for correcting the divided model used by morphological analysis of the next sentence by the analysis unit.

このような側面においては、一つの文を形態素解析する度に、その解析結果と正解データとが比較され、これらの差に基づいて分割モデルが修正される。このように一文毎に分割モデルを修正することで、複数の文を処理した場合の分割モデルの修正に要する時間が、文の個数に応じてほぼ線形的に増大する程度に抑えられるので、形態素解析の分割モデルを一定の時間内に（言い換えると、予測できる時間の範囲内に）自動的に修正することができる。 In such an aspect, every time a sentence is subjected to morphological analysis, the analysis result is compared with the correct answer data, and the divided model is corrected based on these differences. By correcting the split model for each sentence in this way, the time required to correct the split model when processing multiple sentences can be suppressed to an extent that increases almost linearly according to the number of sentences. The analysis split model can be automatically modified within a certain time (in other words, within a predictable time range).

本発明の一側面によれば、形態素解析の分割モデルを一定の時間内に自動的に修正することができる。 According to one aspect of the present invention, a division model for morphological analysis can be automatically corrected within a certain period of time.

実施形態に係る自然言語処理システムでの処理の概念図である。It is a conceptual diagram of the process in the natural language processing system which concerns on embodiment. 実施形態における形態素解析の例を示す図である。It is a figure which shows the example of the morphological analysis in embodiment. 実施形態に係る自然言語処理システムを構成するコンピュータのハードウェア構成を示す図である。It is a figure which shows the hardware constitutions of the computer which comprises the natural language processing system which concerns on embodiment. 実施形態に係る自然言語処理システムの機能構成を示すブロック図である。It is a block diagram which shows the function structure of the natural language processing system which concerns on embodiment. タグ付けの一例を概念的に示す図である。It is a figure which shows an example of tagging notionally. （ａ），（ｂ）はそれぞれ、スコアの更新の一例を模式的に示す図である。(A), (b) is a figure which shows typically an example of the update of a score, respectively. 実施形態に係る自然言語処理システムの動作を示すフローチャートである。It is a flowchart which shows operation | movement of the natural language processing system which concerns on embodiment. 実施形態に係る自然言語処理プログラムの構成を示す図である。It is a figure which shows the structure of the natural language processing program which concerns on embodiment.

以下、添付図面を参照しながら本発明の実施形態を詳細に説明する。なお、図面の説明において同一又は同等の要素には同一の符号を付し、重複する説明を省略する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings. In the description of the drawings, the same or equivalent elements are denoted by the same reference numerals, and redundant description is omitted.

まず、図１〜５を用いて、実施形態に係る自然言語処理システム１０の機能及び構成を説明する。自然言語処理システム１０は、形態素解析を実行するコンピュータシステムである。形態素解析とは、文を形態素の列に分割して各形態素の品詞を判定する処理である。文とは、一つの完結した言明を表す言語表現の単位であり、文字列で表現される。形態素とは、意味を有する最小の言語単位である。形態素の列とは、文を１以上の形態素に分割して得られる該１以上の形態素の並びである。品詞とは、文法上の機能または形態による単語の区分けである。 First, the function and configuration of the natural language processing system 10 according to the embodiment will be described with reference to FIGS. The natural language processing system 10 is a computer system that performs morphological analysis. The morpheme analysis is a process of dividing a sentence into morpheme strings and determining the part of speech of each morpheme. A sentence is a unit of linguistic expression that represents one complete statement, and is expressed by a character string. A morpheme is the smallest language unit that has meaning. A morpheme sequence is a sequence of one or more morphemes obtained by dividing a sentence into one or more morphemes. Part of speech is the division of words by grammatical function or form.

自然言語処理システム１０は、分割モデル２０を用いて個々の文を形態素解析する。自然言語処理システム１０の特徴の一つとして、分割モデル２０を学習する際には個々の文を形態素解析する度にその分割モデル２０を修正する点が挙げられる。分割モデル２０の修正が終われば、確定した分割モデル２０を備える自然言語処理システム１０がユーザに提供される。ユーザはその自然言語処理システム１０に形態素解析を実行させることができ、この際には、分割モデル２０の修正が行われることなく、形態素解析が実行される。本明細書における「分割モデル」とは、文を１以上の形態素に分割する際の基準（手掛かり）であり、各素性のスコアで示される。この分割モデルは、１以上のトレーニングデータを用いた機械学習により得られる。トレーニングデータは、各単語に分割された文と、その文を分割して得られる各単語の品詞とを少なくとも示すデータである。素性（ｆｅａｔｕｒｅ）とは、形態素解析において正しい結果を得るための手掛かりである。一般に、何を素性（手掛かり）として用いるかは限定されない。素性のスコアとは、当該素性の尤もらしさを表す数値である。 The natural language processing system 10 performs morphological analysis on each sentence using the division model 20. One of the features of the natural language processing system 10 is that when the division model 20 is learned, the division model 20 is corrected each time a morphological analysis is performed on each sentence. When the modification of the division model 20 is completed, the natural language processing system 10 including the confirmed division model 20 is provided to the user. The user can cause the natural language processing system 10 to execute morpheme analysis. At this time, the morpheme analysis is executed without correcting the division model 20. The “division model” in this specification is a standard (cue) when a sentence is divided into one or more morphemes, and is indicated by a score of each feature. This division model is obtained by machine learning using one or more training data. The training data is data indicating at least a sentence divided into words and a part of speech of each word obtained by dividing the sentence. A feature is a clue for obtaining a correct result in morphological analysis. In general, what is used as a feature (cue) is not limited. The feature score is a numerical value representing the likelihood of the feature.

図１に、本実施形態に係る自然言語処理システム１０での処理の概念を簡潔に示す。なお、図１における歯車Ｍは形態素解析の実行を示す。ある時点において、自然言語処理システム１０は分割モデルｗ_１を用いた形態素解析を実行することで文ｓ_１を１以上の形態素に分割する。本実施形態では、自然言語処理システム１０は文を個々の文字に分割して文字単位の処理を実行することで文を１以上の形態素に分割する。すなわち、本実施形態では、処理対象となる被分割要素は文字である。自然言語処理システム１０は、個々の文字（被分割要素）にタグを設定することで、形態素解析の結果を示す。本明細書における「タグ」とは、文字の属性または機能を示すラベルである。タグについては後でさらに詳しく説明する。FIG. 1 briefly shows the concept of processing in the natural language processing system 10 according to the present embodiment. Note that a gear M in FIG. 1 indicates execution of morphological analysis. At some point, the natural language processing system 10 divides the sentence s ₁ into one or more morphemes by executing a morphological analysis using the division model w ₁ . In this embodiment, the natural language processing system 10 divides a sentence into one or more morphemes by dividing the sentence into individual characters and executing character unit processing. In other words, in the present embodiment, the divided element to be processed is a character. The natural language processing system 10 indicates the result of morphological analysis by setting a tag for each character (divided element). A “tag” in this specification is a label indicating an attribute or function of a character. Tags will be described in more detail later.

形態素解析を実行すると、自然言語処理システム１０はその文ｓ_１の形態素解析の正解を示すデータ（正解データ）を受け付け、解析結果とその正解データとを比較して分割モデルｗ_１を修正することで新たな分割モデルｗ_２を得る。具体的には、自然言語処理システム１０は、文ｓ_１の形態素解析の少なくとも一部のタグ付けが間違った場合には、解析結果の全体が間違いであったと評価する。そして、自然言語処理システム１０は、正解データ内の各タグに対応する素性を「正しい（＋１）」と評価してその素性のスコアを現在値よりも高くし、解析結果内の各タグに対応する素性を「間違い（−１）」と評価してその素性のスコアを現在値よりも低くすることで、分割モデルｗ_２を得る。解析結果内で一部のタグが正解であった場合には、当該一部のタグ（正解のタグ）に関連する素性の二つの評価「正しい（＋１）」「間違い（＋１）」は結果的に相殺される。したがって、上記のように素性のスコアを低くまたは高くする処理は、不正解のタグに対応する正解のタグ（不正解部分に対応する正解のタグ）に関連する素性のスコアを高くし、該不正解のタグ（不正解部分のタグ）に関連する素性のスコアを低くする処理であるといえる。When the morphological analysis is executed, the natural language processing system 10 accepts data (correct data) indicating the correct answer of the morphological analysis of the sentence s ₁ , compares the analysis result with the correct data, and corrects the divided model w _1. in obtaining a new division model w _2. Specifically, the natural language processing system 10 evaluates that the entire analysis result is wrong when at least a part of tagging in the morphological analysis of the sentence s ₁ is wrong. Then, the natural language processing system 10 evaluates the feature corresponding to each tag in the correct answer data as “correct (+1)”, sets the feature score higher than the current value, and corresponds to each tag in the analysis result. the feature that was evaluated as "wrong (-1)" and is made lower than the current value of the score for that feature, obtain division model w _2. If some tags are correct in the analysis result, the two evaluations of “feature (+1)” and “wrong (+1)” of the features related to the partial tag (correct answer tag) are the result. Is offset by Therefore, the process of lowering or raising the feature score as described above increases the feature score related to the correct tag corresponding to the incorrect tag (correct tag corresponding to the incorrect answer portion) and It can be said that this is a process of lowering the score of the feature related to the correct tag (tag of the incorrect answer part).

なお、自然言語処理システム１０は、正解データ内の各タグを「正しい（＋１）」と評価する一方で、解析結果内の各文字に関するタグを「間違い（−１）」と評価し、各タグについて二つの評価結果を相殺した上で、「正しい（＋１）」と評価されたタグに対応する素性のスコアを高くし、「間違い（−１）」と評価されたタグに対応する素性のスコアを低くしてもよい。 Note that the natural language processing system 10 evaluates each tag in the correct answer data as “correct (+1)”, while evaluating a tag related to each character in the analysis result as “wrong (−1)” and each tag. After canceling the two evaluation results for, the score of the feature corresponding to the tag evaluated as “correct (+1)” is increased, and the score of the feature corresponding to the tag evaluated as “wrong (−1)” May be lowered.

例えば、文ｓ_１内に５個の文字ｘ_ａ，ｘ_ｂ，ｘ_ｃ，ｘ_ｄ，ｘ_ｅが存在するものとする。そして、文字ｘ_ａ，ｘ_ｂ，ｘ_ｃ，ｘ_ｄ，ｘ_ｅの正解のタグがそれぞれｔ_ａ，ｔ_ｂ，ｔ_ｃ，ｔ_ｄ，ｔ_ｅであり、形態素解析により各文字のタグがｔ_ａ，ｔ_ｇ，ｔ_ｈ，ｔ_ｄ，ｔ_ｅであったとする。この場合には、自然言語処理システム１０は、正解データ内のタグｔ_ａ，ｔ_ｂ，ｔ_ｃ，ｔ_ｄ，ｔ_ｅに対応する素性を「正しい（＋１）」と評価してその素性のスコアを現在値よりも高くし、実行結果内のタグｔ_ａ，ｔ_ｇ，ｔ_ｈ，ｔ_ｄ，ｔ_ｅに対応する素性を「間違い（−１）」と評価してその素性のスコアを現在値よりも低くする。この場合、タグｔ_ａ，ｔ_ｄ，ｔ_ｃに対応する素性のスコアは結果的には更新前と変わらず、正解のタグｔ_ｂ，ｔ_ｃに対応する素性のスコアが高くなり、不正解のタグｔ_ｇ、ｔ_ｈに対応する素性のスコアが低くなる。For example, it is assumed that five characters x _a , x _b , x _c , x _d , and x _e exist in the sentence s ₁ . The character _x _a, x _b, x _c, x d, where _x correct tags each _t a of _{_{_{_{e, t b, t c,}}}} t d, t e, the tag of each character by the morphological analysis _{t a} _, _t g, and had a _{_{t h, t d, t e}} . In this case, the natural language processing system 10, the tag _t a in solution _{_{_{data, t b, t c, t}}} d, the score of the feature by evaluating the identity corresponding to _{t e} a "correct (+1)" was higher than the current value, the tag _t a in the execution _{_{_{result, t g, t h, t}}} d, "definitely (-1)" a feature that corresponds to _{t e} and evaluate the current value of the score of the identity and Lower than. In this case, the score of the feature corresponding to the tags t _a , t _d , and t _c is not changed from the result before the update, and the score of the feature corresponding to the correct tags t _b and t _c is increased, and the incorrect answer The score of the feature corresponding to the tags t _g and t _h is lowered.

次の文ｓ_２に対する形態素解析を実行する場合には、自然言語処理システム１０はその分割モデルｗ_２を用いる。そして、自然言語処理システム１０はその文ｓ_２の形態素解析の正解データを受け付け、実行結果とその正解データとを比較し、分割モデルｗ_１を修正する場合と同様に分割モデルｗ_２を修正することで新たな分割モデルｗ_３を得る。When executing the morphological analysis for the next sentence s ₂ , the natural language processing system 10 uses the division model w ₂ . Then, the natural language processing system 10 accepts a correct answer data of the morphological analysis of the statement s _2, compares the execution result and its correct data, to modify the same divided model w ₂ in the case of modifying the division models w ₁ obtain a new division model w ₃ that.

自然言語処理システム１０はこのように一つの文（ｓ_１，ｓ_２，…，ｓ_ｔ）を処理する度に分割モデルを修正し（ｗ_１→ｗ_２，ｗ_２→ｗ_３，…，ｗ_ｔ→ｗ_ｔ＋１）、次の文の形態素解析で修正後の分割モデルを用いる。このように一つのトレーニングデータを処理する度にモデルを更新する手法は、「オンライン学習」または「オンラインの機械学習」ともいわれる。The natural language processing system 10 corrects the division model (w ₁ → w ₂ , w ₂ → w ₃ ,..., W each time one sentence (s ₁ , s ₂ ,..., _St) is processed in this way. _t → w _{t + 1} ), and use the modified division model in the morphological analysis of the next sentence. Such a method of updating the model every time one piece of training data is processed is also referred to as “online learning” or “online machine learning”.

自然言語処理システム１０による形態素解析の結果の例を図２に示す。この例では、自然言語処理システム１０は、「Ｉｂｏｕｇｈｔａｂｏｏｋ」という英文に相当する、「本を買って（ｈｏｎｗｏｋａｔｔｅ）」という日本語の文を、５個の文字ｘ_１：「本（ｈｏｎ）」，ｘ_２：「を（ｗｏ）」，ｘ_３：「買（ｋａ）」，ｘ_４：「っ（ｔ）」，ｘ_５：「て（ｔｅ）」に分割する。そして、自然言語処理システム１０は形態素解析を実行することで、各文字にタグを設定する。本実施形態では、タグは、単語内での文字の出現態様と、その単語の品詞と、その単語の品詞のサブクラスとの組合せであり、「Ｓ−Ｎ−ｎｃ」などのようにアルファベットを用いて表現される。An example of the result of morphological analysis by the natural language processing system 10 is shown in FIG. In this example, the natural language processing system 10 converts a Japanese sentence “hon wo katte”, which corresponds to an English sentence “I blog a book”, into five letters x ₁ : “book”. (hon) ", _{x 2:"} the (wo) ", _{x 3:"} Offer (ka) ", _{x 4:"} Tsu (t) ", _{x 5:} divided into" te (te) ". And the natural language processing system 10 sets a tag to each character by performing morphological analysis. In this embodiment, the tag is a combination of the appearance mode of a character in a word, the part of speech of the word, and the subclass of the part of speech of the word, and uses an alphabet such as “S-N-nc”. Expressed.

出現態様は、ある文字が単独で一つの単語となるかそれとも他の文字との組合せで一つの単語になるかということと、文字が、２文字以上から成る単語の一部である場合に、その文字が単語内のどこに位置するかということとを示す情報である。本実施形態では、出現態様はＳ，Ｂ，Ｉ，Ｅのいずれかで示される。出現態様「Ｓ」は、文字がそれ単独で一つの単語になることを示す。出現態様「Ｂ」は、文字が、２文字以上から成る単語の先頭に位置することを示す。出現態様「Ｉ」は、文字が、３文字以上から成る単語の途中に位置することを示す。出現態様「Ｅ」は、文字が、２文字以上から成る単語の末尾に位置することを示す。図２の例は、文字ｘ_１，ｘ_２，ｘ_５が単独で一つの単語であり、文字ｘ_３，ｘ_４で１単語が形成されることを示している。The appearance mode is whether a certain character becomes one word alone or in combination with another character, and when the character is a part of a word consisting of two or more characters, This is information indicating where in the word the character is located. In the present embodiment, the appearance mode is indicated by one of S, B, I, and E. The appearance mode “S” indicates that the character becomes a single word by itself. The appearance mode “B” indicates that the character is positioned at the beginning of a word composed of two or more characters. The appearance mode “I” indicates that the character is located in the middle of a word composed of three or more characters. The appearance mode “E” indicates that the character is located at the end of a word composed of two or more characters. The example of FIG. 2 indicates that the characters x ₁ , x ₂ , and x ₅ are each a single word, and the characters x ₃ and x ₄ form one word.

なお、出現態様についてのスキームは限定されない。本実施形態では、「ＳＢＩＥＯ」というスキームを用いているが、例えば、当業者に周知である「ＩＯＢ２」というスキームを用いてもよい。 In addition, the scheme about an appearance aspect is not limited. In this embodiment, the scheme “SBIEO” is used, but for example, the scheme “IOB2” that is well known to those skilled in the art may be used.

品詞の例としては、名詞、動詞、助詞、形容詞、形容動詞、接続詞などが挙げられる。本実施形態では、名詞は「Ｎ」で表され、助詞は「Ｐ」で表され、動詞は「Ｖ」で表される。図２の例は、文字ｘ_１が名詞であり、文字ｘ_２が助詞であり、文字ｘ_３，ｘ_４から成る単語が動詞であり、文字ｘ_５が助詞であることを示している。Examples of parts of speech include nouns, verbs, particles, adjectives, adjective verbs, conjunctions, and the like. In this embodiment, the noun is represented by “N”, the particle is represented by “P”, and the verb is represented by “V”. The example of FIG. 2 indicates that the character x ₁ is a noun, the character x ₂ is a particle, the word consisting of the characters x ₃ and x ₄ is a verb, and the character x ₅ is a particle.

品詞のサブクラスは、対応する品詞の下位概念を示す。例えば、名詞は一般名詞と固有名詞とにさらに分類することができ、助詞は格助詞、接続助詞、係助詞などにさらに分類することができる。本実施形態では、一般名詞は「ｎｃ」で表され、固有名詞は「ｎｐ」で表され、格助詞は「ｋ」で表され、接続助詞は「ｓｊ」で表され、一般動詞は「ｃ」で表される。図２の例は、文字ｘ_１が一般名詞であり、文字ｘ_２が格助詞であり、文字ｘ_３，ｘ_４から成る単語が一般動詞であり、文字ｘ_５が接続助詞であることを示している。The part-of-speech subclass indicates the subordinate concept of the corresponding part-of-speech. For example, nouns can be further classified into general nouns and proper nouns, and particles can be further classified into case particles, conjunctive particles, auxiliary particles, and the like. In this embodiment, the general noun is represented by “nc”, the proper noun is represented by “np”, the case particle is represented by “k”, the connection particle is represented by “sj”, and the general verb is “c”. Is represented. The example of FIG. 2 shows that the character x ₁ is a general noun, the character x ₂ is a case particle, the word consisting of the characters x ₃ and x ₄ is a general verb, and the character x ₅ is a connection particle. ing.

分割モデル２０が記憶する素性のスコアは、出力素性（ｅｍｉｓｓｉｏｎｆｅａｔｕｒｅ）のスコアおよび遷移素性（ｔｒａｎｓｉｔｉｏｎｆｅａｔｕｒｅ）のスコアである。 The score of the feature stored in the division model 20 is a score of an output feature and a score of a transition feature.

出力素性とは、タグと文字または文字種との対応を示す手掛かりである。言い換えると、出力素性とは、どのようなタグに対してどのような文字または文字種が対応しやすいかを示す手掛かりである。出力素性は、隠れマルコフモデルの出力行列の素性表現に対応する。本実施形態では、ユニグラム（１文字のみから成る文字列）の出力素性と、バイグラム（連続する２文字から成る文字列）の出力素性とを用いる。 The output feature is a clue indicating a correspondence between a tag and a character or a character type. In other words, the output feature is a clue indicating what kind of character or character type is likely to correspond to what kind of tag. The output feature corresponds to the feature representation of the output matrix of the hidden Markov model. In this embodiment, an output feature of a unigram (a character string made up of only one character) and an output feature of a bigram (a character string made up of two consecutive characters) are used.

ここで、文字種とはある言語における文字の種類のことである。日本語の文字種として、例えば、漢字、平仮名、片仮名、アルファベット（大文字および小文字）、アラビア数字、漢数字、および中黒（・）が挙げられる。なお、本実施形態では、文字種をアルファベットで表す。例えば、「Ｃ」は漢字を示し、「Ｈ」は平仮名を示し、「Ｋ」は片仮名を示し、「Ｌ」はアルファベットを示し、「Ａ」はアラビア数字を示す。図２の例は、文字ｘ_１，ｘ_３が漢字であり、文字ｘ_２，ｘ_４，ｘ_５が平仮名であることを示している。Here, the character type is a character type in a certain language. Examples of Japanese character types include kanji, hiragana, katakana, alphabet (uppercase and lowercase), Arabic numerals, kanji numerals, and middle black (•). In the present embodiment, the character type is represented by alphabets. For example, “C” indicates kanji, “H” indicates hiragana, “K” indicates katakana, “L” indicates alphabets, and “A” indicates Arabic numerals. The example of FIG. 2 indicates that the characters x ₁ and x ₃ are kanji characters and the characters x ₂ , x ₄ , and x ₅ are hiragana characters.

文字に関するユニグラムの出力素性は、タグｔと文字ｘとの対応を示す手掛かりである。また、文字種に関するユニグラムの出力素性は、タグｔと文字種ｃとの対応を示す手掛かりである。本実施形態では、タグｔと文字ｘとの対応の尤もらしさのスコアｓを｛ｔ／ｘ，ｓ｝で示す。また、タグｔと文字種ｃとの対応の尤もらしさのスコアｓを｛ｔ／ｃ，ｓ｝で示す。分割モデル２０は、一つの文字または文字種に対して複数のタグに関するスコアを含む。一つの文字または文字種に対して、すべての種類のタグに関するデータが用意される場合には、分割モデル２０は、文法上、実際には起こりえないタグと文字または文字種との組合せについてのスコアも含む。ただし、文法上有り得ない素性のスコアは、相対的に低くなる。 The output feature of the unigram regarding the character is a clue indicating the correspondence between the tag t and the character x. Further, the output feature of the unigram regarding the character type is a clue indicating the correspondence between the tag t and the character type c. In the present embodiment, the likelihood score s of the correspondence between the tag t and the letter x is indicated by {t / x, s}. A likelihood score s of correspondence between the tag t and the character type c is denoted by {t / c, s}. The division model 20 includes scores regarding a plurality of tags for one character or character type. When data on all types of tags is prepared for one character or character type, the division model 20 also has a score for a combination of a tag and a character or character type that cannot actually occur in the grammar. Including. However, the score of a feature that is impossible in grammar is relatively low.

以下に、日本語の「本（ｈｏｎ）」という文字に関する出力素性のスコアの例を示す。この文字が助詞であることは日本語の文法上有り得ないが、上述した通り、文法上存在しない「Ｓ−Ｐ−ｋ／本（ｈｏｎ）」のような素性についてもデータが用意され得る。
｛Ｓ−Ｎ−ｎｃ／本（ｈｏｎ），０．０４２０｝
｛Ｂ−Ｎ−ｎｃ／本（ｈｏｎ），０．０３１０｝
｛Ｓ−Ｐ−ｋ／本（ｈｏｎ），０．０００３｝
｛Ｂ−Ｖ−ｃ／本（ｈｏｎ），０．００３１｝The following is an example of an output feature score for the Japanese word “hon”. Although it is impossible in Japanese grammar that this character is a particle, as described above, data such as “SPK / hon” that does not exist in the grammar can be prepared.
{S-N-nc / hon, 0.0420}
{B-N-nc / hon, 0.0310}
{SPK / hon, 0.0003}
{BV-c / hon, 0.0031}

また、文字種「漢字」に関する出力素性のスコアの例を示す。
｛Ｓ−Ｎ−ｎｃ／Ｃ，０．０２５５｝
｛Ｅ−Ｎ−ｎｐ／Ｃ，０．０４８８｝
｛Ｓ−Ｐ−ｋ／Ｃ，０．００００｝
｛Ｂ−Ｖ−ｃ／Ｃ，０．０２９９｝In addition, an example of an output feature score regarding the character type “Kanji” is shown.
{S-N-nc / C, 0.0255}
{E-N-np / C, 0.0488}
{SPK / C, 0.0000}
{BVc / C, 0.0299}

文字種に関しても、文法上存在しない素性を示すデータが用意され得る。例えば、アラビア数字で表される単語が助詞になることは日本語の文法上有り得ないが、「Ｓ−Ｐ−ｋ／Ａ」のような素性についてもデータが用意され得る。 Regarding character types, data indicating features that do not exist in the grammar can be prepared. For example, a word represented by Arabic numerals cannot be a particle in Japanese grammar, but data can be prepared for a feature such as “SPK / A”.

文字に関するバイグラムの出力素性は、タグｔと文字列ｘ_ｉｘ_ｉ＋１との対応を示す手掛かりである。また、文字種に関するバイグラムの出力素性は、タグｔと文字種の列ｃ_ｉｃ_ｉ＋１との対応を示す手掛かりである。本実施形態では、タグｔおよび文字ｘ_ｉ，ｘ_ｉ＋１の尤もらしさのスコアｓを｛ｔ／ｘ_ｉ／ｘ_ｉ＋１，ｓ｝で示す。また、タグｔおよび文字種ｃ_ｉ，ｃ_ｉ＋１の尤もらしさのスコアｓを｛ｔ／ｃ_ｉ／ｃ_ｉ＋１，ｓ｝で示す。一つのバイグラムに対して、存在し得るすべてのタグに関するデータを用意する場合には、分割モデル２０は、文法上、実際には起こりえないタグとバイグラムとの組合せについてのデータも記憶する。The bigram output feature related to the character is a clue indicating the correspondence between the tag t and the character string x _i x _{i + 1} . The bigram output feature related to the character type is a clue indicating the correspondence between the tag t and the character type column c _i c _{i + 1} . In the present embodiment, the likelihood score s of the tag t and the characters x _i and x _{i + 1} is represented by {t / x _i / x _{i + 1} , s}. Further, the likelihood score s of the tag t and the character types c _i and c _{i + 1} is denoted by {t / c _i / c _{i + 1} , s}. When preparing data related to all tags that can exist for one bigram, the division model 20 also stores data on combinations of tags and bigrams that cannot actually occur in the grammar.

以下に、「本を（ｈｏｎｗｏ）」というバイグラムに関する出力素性のスコアの例を示す。
｛Ｓ−Ｎ−ｎｃ／本（ｈｏｎ）／を（ｗｏ），０．０４２０｝
｛Ｂ−Ｎ−ｎｃ／本（ｈｏｎ）／を（ｗｏ），０．００００｝
｛Ｓ−Ｐ−ｋ／本（ｈｏｎ）／を（ｗｏ），０．０００１｝
｛Ｂ−Ｖ−ｃ／本（ｈｏｎ）／を（ｗｏ），０．０００９｝The following is an example of an output feature score for a bigram “hon wo”.
{S-N-nc / hon / (wo), 0.0420}
{B-N-nc / book / hon (wo), 0.0000}
{SPK / hon / (wo), 0.0001}
{BVc / book / hon (wo), 0.0009}

また、漢字の次に平仮名が現れるバイグラムに関する出力素性のスコアの例を示す。
｛Ｓ−Ｎ−ｎｃ／Ｃ／Ｈ，０．０４５５｝
｛Ｅ−Ｎ−ｎｐ／Ｃ／Ｈ，０．０４１２｝
｛Ｓ−Ｐ−ｋ／Ｃ／Ｈ，０．００００｝
｛Ｂ−Ｖ−ｃ／Ｃ／Ｈ，０．００５４｝Moreover, the example of the score of the output feature regarding the bigram in which hiragana appears after the kanji is shown.
{S-N-nc / C / H, 0.0455}
{E-N-np / C / H, 0.0412}
{SPK / C / H, 0.0000}
{BVc / C / H, 0.0054}

遷移素性とは、文字ｘ_ｉのタグｔ_ｉとその次の文字ｘ_ｉ＋１のタグｔ_ｉ＋１との組合せ（連続する２文字に対応する二つのタグから成る組合せ）を示す手掛かりである。この遷移素性はバイグラムに関する素性である。遷移素性は、隠れマルコフモデルの遷移行列の素性表現に対応する。本実施形態では、タグｔ_ｉとタグｔ_ｉ＋１との組合せの尤もらしさのスコアｓを｛ｔ_ｉ／ｔ_ｉ＋１，ｓ｝で示す。存在し得るすべての組合せに関する遷移素性のデータを用意する場合には、分割モデル２０は、文法上、実際には起こりえない二つのタグの組合せについてのデータも記憶する。The transition feature, a cue indicating a combination (combination consisting of two tags corresponding to two consecutive characters) tags t _i and tag t _{i + 1} of the next character x _{i + 1} character x _i. This transition feature is a bigram feature. The transition feature corresponds to the feature representation of the transition matrix of the hidden Markov model. In the present embodiment, the likelihood score s of the combination of the tag t _i and the tag t _{i + 1} is represented by {t _i / t _{i + 1} , s}. In the case of preparing transition feature data for all possible combinations, the division model 20 also stores data on combinations of two tags that cannot actually occur in the grammar.

以下に、遷移素性のスコアのいくつかの例を示す。
｛Ｓ−Ｎ−ｎｃ／Ｓ−Ｐ−ｋ，０．０５１２｝
｛Ｅ−Ｎ−ｎｃ／Ｅ−Ｎ−ｎｃ，０．００００｝
｛Ｓ−Ｐ−ｋ／Ｂ−Ｖ−ｃ，０．０４２５｝
｛Ｂ−Ｖ−ｃ／Ｉ−Ｖ−ｃ，０．０３８７｝Below, some examples of transition feature scores are shown.
{S-N-nc / S-Pk, 0.0512}
{E-N-nc / E-N-nc, 0.0000}
{SPK / BVc, 0.0425}
{BVc / IVc, 0.0387}

自然言語処理システム１０は１台以上のコンピュータを備え、複数台のコンピュータを備える場合には、後述する自然言語処理システム１０の各機能要素は分散処理により実現される。個々のコンピュータの種類は限定されない。例えば、据置型または携帯型のパーソナルコンピュータ（ＰＣ）を用いてもよいし、ワークステーションを用いてもよいし、高機能携帯電話機（スマートフォン）や携帯電話機、携帯情報端末（ＰＤＡ）などの携帯端末を用いてもよい。あるいは、様々な種類のコンピュータを組み合わせて自然言語処理システム１０を構築してもよい。複数台のコンピュータを用いる場合には、これらのコンピュータはインターネットやイントラネットなどの通信ネットワークを介して接続される。 When the natural language processing system 10 includes one or more computers and includes a plurality of computers, each functional element of the natural language processing system 10 described later is realized by distributed processing. The type of individual computer is not limited. For example, a stationary or portable personal computer (PC) may be used, a workstation may be used, or a portable terminal such as a high-functional portable telephone (smart phone), a portable telephone, or a personal digital assistant (PDA). May be used. Alternatively, the natural language processing system 10 may be constructed by combining various types of computers. When a plurality of computers are used, these computers are connected via a communication network such as the Internet or an intranet.

自然言語処理システム１０内の個々のコンピュータ１００の一般的なハードウェア構成を図３に示す。コンピュータ１００は、オペレーティングシステムやアプリケーション・プログラムなどを実行するＣＰＵ（プロセッサ）１０１と、ＲＯＭ及びＲＡＭで構成される主記憶部１０２と、ハードディスクやフラッシュメモリなどで構成される補助記憶部１０３と、ネットワークカードあるいは無線通信モジュールで構成される通信制御部１０４と、キーボードやマウスなどの入力装置１０５と、ディスプレイやプリンタなどの出力装置１０６とを備える。当然ながら、搭載されるハードウェアモジュールはコンピュータ１００の種類により異なる。例えば、据置型のＰＣおよびワークステーションは入力装置および出力装置としてキーボード、マウス、およびモニタを備えることが多いが、スマートフォンはタッチパネルが入力装置および出力装置として機能することが多い。 A general hardware configuration of each computer 100 in the natural language processing system 10 is shown in FIG. A computer 100 includes a CPU (processor) 101 that executes an operating system, application programs, and the like, a main storage unit 102 that includes a ROM and a RAM, an auxiliary storage unit 103 that includes a hard disk and a flash memory, and a network. The communication control unit 104 includes a card or a wireless communication module, an input device 105 such as a keyboard and a mouse, and an output device 106 such as a display and a printer. Of course, the hardware modules to be mounted differ depending on the type of the computer 100. For example, a stationary PC and a workstation often include a keyboard, a mouse, and a monitor as an input device and an output device, but in a smartphone, a touch panel often functions as an input device and an output device.

後述する自然言語処理システム１０の各機能要素は、ＣＰＵ１０１または主記憶部１０２の上に所定のソフトウェアを読み込ませ、ＣＰＵ１０１の制御の下で通信制御部１０４や入力装置１０５、出力装置１０６などを動作させ、主記憶部１０２または補助記憶部１０３におけるデータの読み出し及び書き込みを行うことで実現される。処理に必要なデータやデータベースは主記憶部１０２または補助記憶部１０３内に格納される。 Each functional element of the natural language processing system 10 described later reads predetermined software on the CPU 101 or the main storage unit 102, and operates the communication control unit 104, the input device 105, the output device 106, and the like under the control of the CPU 101. This is realized by reading and writing data in the main storage unit 102 or the auxiliary storage unit 103. Data and a database necessary for processing are stored in the main storage unit 102 or the auxiliary storage unit 103.

一方、分割モデル２０は予め記憶装置に記憶される。分割モデル２０の具体的な実装方法は限定されず、例えば分割モデル２０は関係データベースあるいはテキストファイルとして用意されてもよい。また、分割モデル２０の設置場所は限定されず、例えば、分割モデル２０は自然言語処理システム１０の内部に存在してもよいし、自然言語処理システム１０とは異なる他のコンピュータシステム内に存在してもよい。分割モデル２０が他の自然言語処理システム内にある場合には、自然言語処理システム１０は通信ネットワークを介して分割モデル２０にアクセスする。 On the other hand, the division model 20 is stored in the storage device in advance. The specific mounting method of the division model 20 is not limited. For example, the division model 20 may be prepared as a relational database or a text file. The installation location of the division model 20 is not limited. For example, the division model 20 may exist in the natural language processing system 10 or in another computer system different from the natural language processing system 10. May be. When the division model 20 is in another natural language processing system, the natural language processing system 10 accesses the division model 20 via a communication network.

上述したように、分割モデル２０は様々な素性のスコアの集合であるともいえる。数式上では、ｎ個の素性のスコアｗ_１，ｗ_２，…，ｗ_ｎを含む分割モデル２０をベクトルｗ＝｛ｗ_１，ｗ_２，…，ｗ_ｎ｝で示すことができる。分割モデル２０が新規に作成された時点での各素性のスコアはすべて０である。すなわち、ｗ＝｛０，０，…，０｝である。後述する自然言語処理システム１０の処理により、そのスコアは少しずつ更新されていく。ある程度多くの文が処理された後には、上記のように個々の素性のスコアの間に差が生じてくる。As described above, it can be said that the division model 20 is a set of scores of various features. On formula, score _w 1 of the n _feature, w 2, ..., vector division model 20 including a _{_{_{w n w = {w 1,}}} w 2, ..., w n} can be represented by. The scores of the features at the time when the division model 20 is newly created are all zero. That is, w = {0, 0,..., 0}. The score is updated little by little by the processing of the natural language processing system 10 described later. After a certain number of sentences have been processed, there is a difference between the individual feature scores as described above.

図４に示すように、自然言語処理システム１０は機能的構成要素として取得部１１、解析部１２、および修正部１３を備える。自然言語処理システム１０は必要に応じて分割モデル２０にアクセスする。以下に各機能要素について説明するが、本実施形態では自然言語処理システム１０が日本語の文を処理することを前提に説明する。もっとも、自然言語処理システム１０が処理する文の言語は日本語に限定されず、中国語などの他の言語の文を解析することも可能である。 As shown in FIG. 4, the natural language processing system 10 includes an acquisition unit 11, an analysis unit 12, and a correction unit 13 as functional components. The natural language processing system 10 accesses the division model 20 as necessary. Each functional element will be described below. In the present embodiment, the description will be made on the assumption that the natural language processing system 10 processes a Japanese sentence. However, the language of the sentence processed by the natural language processing system 10 is not limited to Japanese, and sentences in other languages such as Chinese can be analyzed.

取得部１１は、形態素の列に分割しようとする文を取得する機能要素である。取得部１１による文の取得方法は限定されない。例えば、取得部１１はインターネット上の任意のウェブサイトから文を収集してもよい（いわゆる、クローリング（ｃｒａｗｌｉｎｇ））。あるいは、取得部１１は自然言語処理システム１０内のデータベースに予め蓄積された文を読み出してもよいし、自然言語処理システム１０以外のコンピュータシステム上にあるデータベースに予め蓄積された文を通信ネットワーク経由でアクセスして読み出してもよい。あるいは、取得部１１は自然言語処理システム１０のユーザが入力した文を受け付けてもよい。最初の文の解析の指示が入力されると、取得部１１は一つの文を取得して解析部１２に出力する。その後、後述する修正部１３から完了通知が入力されると、取得部１１は次の文を取得して解析部１２に出力する。 The acquisition unit 11 is a functional element that acquires a sentence to be divided into morpheme strings. The acquisition method of the sentence by the acquisition part 11 is not limited. For example, the acquisition unit 11 may collect sentences from any website on the Internet (so-called crawling). Alternatively, the acquisition unit 11 may read a sentence stored in advance in a database in the natural language processing system 10, or read a sentence stored in a database on a computer system other than the natural language processing system 10 via a communication network. It may be accessed and read by. Alternatively, the acquisition unit 11 may accept a sentence input by a user of the natural language processing system 10. When an instruction to analyze the first sentence is input, the acquisition unit 11 acquires one sentence and outputs it to the analysis unit 12. Thereafter, when a completion notification is input from the correction unit 13 described later, the acquisition unit 11 acquires the next sentence and outputs it to the analysis unit 12.

解析部１２は個々の文に対して形態素解析を実行する機能要素である。解析部１２は一つの文が入力される度に以下の処理を実行する。 The analysis unit 12 is a functional element that performs morphological analysis on each sentence. The analysis unit 12 executes the following process every time one sentence is input.

まず、解析部１２は一つの文を個々の文字に分割し、各文字の文字種を判定する。解析部１２は、文字と文字種との対比表、または文字種を判定するための正規表現を予め記憶しており、その対比表または正規表現を用いて文字種を判定する。 First, the analysis unit 12 divides one sentence into individual characters, and determines the character type of each character. The analysis unit 12 stores in advance a comparison table between characters and character types, or a regular expression for determining a character type, and determines a character type using the comparison table or regular expression.

続いて、解析部１２はビタビ・アルゴリズム（Ｖｉｔｅｒｂｉａｌｇｏｒｉｔｈｍ）を用いて各文字のタグを決定する。ｉ番目の文字に対して、解析部１２は、最終的に選択される可能性があるタグ（候補タグ）のそれぞれついて、（ｉ−１）番目の文字の複数の候補タグのうちどの候補タグと接続した場合にスコア（これを「接続スコア」ともいう）がいちばん高くなるかを判定する。ここで、接続スコアは、計算対象のタグに関する各種スコア（ユニグラムの出力素性のスコア、バイグラムの出力素性のスコア、および遷移素性のスコア）の合計値である。例えば、解析部１２は、ｉ番目のタグが「Ｓ−Ｎ−ｎｃ」の場合には、（ｉ−１）番目のタグが「Ｓ−Ｐ−ｋ」である場合に接続スコアが一番高くなり、ｉ番目のタグが「Ｓ−Ｖ−ｃ」の場合には、（ｉ−１）番目のタグが「Ｅ−Ｎ−ｎｃ」である場合に接続スコアが一番高くなる、などと判定する。そして、解析部１２は、接続スコアが最も高くなる組合せ（例えば、（Ｓ−Ｐ−ｋ，Ｓ−Ｎ−ｎｃ）、（Ｅ−Ｎ−ｎｃ，Ｓ−Ｖ−ｃ）など）をすべて記憶する。解析部１２は、最初の文字から文末記号まで１文字ずつ進みながらこのような処理を実行する。 Subsequently, the analysis unit 12 determines a tag for each character using a Viterbi algorithm. For the i-th character, the analysis unit 12 determines which candidate tag among the plurality of candidate tags of the (i-1) -th character for each tag (candidate tag) that may be finally selected. It is determined whether or not the score (also referred to as “connection score”) is the highest. Here, the connection score is a total value of various scores related to the calculation target tag (unigram output feature score, bigram output feature score, and transition feature score). For example, when the i-th tag is “S-N-nc”, the analysis unit 12 has the highest connection score when the (i−1) -th tag is “S-Pk”. When the i-th tag is “SVC”, it is determined that the connection score is highest when the (i−1) -th tag is “E-N-nc”. To do. And the analysis part 12 memorize | stores all the combinations (for example, (S-Pk, S-N-nc), (E-N-nc, S-V-c) etc.) with the highest connection score. . The analysis unit 12 executes such processing while proceeding character by character from the first character to the sentence end symbol.

文末記号に対しては一種類のタグ（ＥＯＳ）しか存在しないので、接続スコアが最も高い、最後の文字と文末記号とのタグとの組合せは一つに決まる（例えば、その組合せが（Ｅ−Ｖ−ｃ，ＥＯＳ）であると決まる）。そうすると、最後の文字のタグが決まり（例えば、そのタグは「Ｅ−Ｖ−ｃ」であると決まる）、その結果、最後から２番目の文字のタグも決まる。結果として、文の最後から先頭に向かって順番に、芋づる式にタグが確定する。 Since there is only one type of tag (EOS) for the end-of-sentence symbol, the combination of the tag of the last character and the end-of-sentence symbol having the highest connection score is determined as one (for example, the combination is (E- Vc, EOS)). Then, the tag of the last character is determined (for example, the tag is determined to be “EVC”), and as a result, the tag of the second character from the end is also determined. As a result, the tags are fixed to the formulas in order from the end to the beginning of the sentence.

このような解析部１２による処理を模式的に図５に示す。図５は、４文字から成る文のタグ付けをする一例を示す。説明を簡単にするために、この例ではタグを「Ａ１」「Ｂ２」などのように簡略化して示し、各文字の候補タグの個数を３としている。図５における太線は、文を前方から処理することで得られる、接続スコアが最も高いと判定されたタグとタグとの組合せを示す。例えば３文字目の処理では、タグＣ１についてはタグＢ１との接続スコアが最も高く、タグＣ２についてはタグＢ１との接続スコアが最も高く、タグＣ３についてはタグＢ２との接続スコアが最も高い。図５の例では、文の最後（ＥＯＳ）まで処理すると、組合せ（Ｄ１，ＥＯＳ）が確定し、続いて、組合せ（Ｃ２，Ｄ１）が確定し、その後、組合せ（Ｂ１，Ｃ２）、（Ａ２，Ｂ１）が順次確定する。したがって、解析部１２は、１〜４文字目のタグがそれぞれＡ２，Ｂ１，Ｃ２，Ｄ１であると判定する。 Such processing by the analysis unit 12 is schematically shown in FIG. FIG. 5 shows an example of tagging a sentence consisting of four characters. In order to simplify the explanation, in this example, the tags are simplified as “A1”, “B2”, etc., and the number of candidate tags for each character is three. A thick line in FIG. 5 indicates a combination of a tag determined to have the highest connection score obtained by processing a sentence from the front. For example, in the process of the third character, the tag C1 has the highest connection score with the tag B1, the tag C2 has the highest connection score with the tag B1, and the tag C3 has the highest connection score with the tag B2. In the example of FIG. 5, when processing is performed up to the end of the sentence (EOS), the combination (D1, EOS) is determined, then the combination (C2, D1) is determined, and then the combinations (B1, C2), (A2) , B1) are determined sequentially. Therefore, the analysis unit 12 determines that the tags of the first to fourth characters are A2, B1, C2, and D1, respectively.

解析部１２は各文字がタグ付けされた文を解析結果として出力する。解析部１２は解析結果を少なくとも修正部１３に出力するが、この理由は、その解析結果が分割モデル２０の修正に必要だからである。解析部１２は更なる出力を実行してもよい。例えば、解析部１２は解析結果をモニタ上に表示したりプリンタに印刷したりしてもよいし、解析結果をテキストファイルに書き出してもよいし、解析結果をメモリやデータベースなどの記憶装置に格納してもよい。あるいは、解析部１２は解析結果を通信ネットワーク経由で自然言語処理システム１０以外の他の任意のコンピュータシステムに送信してもよい。 The analysis unit 12 outputs a sentence with each character tagged as an analysis result. The analysis unit 12 outputs the analysis result to at least the correction unit 13 because the analysis result is necessary for correcting the divided model 20. The analysis unit 12 may perform further output. For example, the analysis unit 12 may display the analysis result on a monitor or print it on a printer, write the analysis result to a text file, or store the analysis result in a storage device such as a memory or a database. May be. Alternatively, the analysis unit 12 may transmit the analysis result to an arbitrary computer system other than the natural language processing system 10 via a communication network.

修正部１３は、解析部１２から得られた解析結果と、その文の形態素解析の正解との差に基づいて分割モデル２０を修正する機能要素である。本明細書における「分割モデルの修正」とは、分割モデル内の少なくとも一つの素性のスコアを変更する処理である。なお、場合によっては、あるスコアを変更しようとしても結果的に値が変わらない場合があり得る。修正部１３は解析結果が一つ入力される度に以下の処理を実行する。 The correction unit 13 is a functional element that corrects the divided model 20 based on the difference between the analysis result obtained from the analysis unit 12 and the correct answer of the morphological analysis of the sentence. In the present specification, “modification of the division model” is processing for changing the score of at least one feature in the division model. In some cases, even if an attempt is made to change a certain score, the value may not change as a result. The correction unit 13 executes the following processing every time one analysis result is input.

まず、修正部１３は入力された解析結果に対応する正解データ、すなわち、解析部１２により処理された文の形態素解析の正解を示すデータを取得する。本実施形態における正解データとは、文を形成する各文字のタグ（出現態様、品詞、および、品詞のサブクラスの組合せ）を示すデータである。この正解データは人手により作成される。修正部１３による正解データの取得方法は限定されない。例えば、修正部１３は自然言語処理システム１０内のデータベースに予め蓄積された正解データを読み出してもよいし、自然言語処理システム１０以外のコンピュータシステム上にあるデータベースに予め蓄積された文を通信ネットワーク経由でアクセスして読み出してもよい。あるいは、修正部１３は自然言語処理システム１０のユーザが入力した正解データを受け付けてもよい。 First, the correction unit 13 acquires correct answer data corresponding to the input analysis result, that is, data indicating the correct answer of the morphological analysis of the sentence processed by the analysis unit 12. The correct answer data in the present embodiment is data indicating tags (combination of appearance mode, part of speech, and part of speech subclass) of each character forming a sentence. This correct answer data is created manually. The method for acquiring correct data by the correction unit 13 is not limited. For example, the correction unit 13 may read correct data stored in advance in a database in the natural language processing system 10, or may read sentences stored in a database on a computer system other than the natural language processing system 10 in a communication network. You may access and read via. Alternatively, the correction unit 13 may accept correct answer data input by the user of the natural language processing system 10.

正解データを取得すると、修正部１３は入力された解析結果とその正解データとを比較してこれらの間の差を特定する。 When the correct data is acquired, the correction unit 13 compares the input analysis result with the correct data to identify the difference between them.

解析結果が正解データと完全に一致して差が無い場合には、修正部１３は分割モデル２０を修正することなく処理を終了し、完了通知を生成して取得部１１に出力する。この完了通知は、修正部１３での処理が終了して次の文に対する形態素解析が実行可能になったことを示す信号である。解析結果が正解データと完全に一致したということは、少なくともこの時点で分割モデル２０を修正する必要がないので、自然言語処理システム１０（より具体的には解析部１２）は現在の分割モデル２０をそのまま用いて次の文を解析する。 If the analysis result completely matches the correct data and there is no difference, the correction unit 13 ends the process without correcting the divided model 20, generates a completion notification, and outputs it to the acquisition unit 11. This completion notification is a signal indicating that the processing in the correction unit 13 has been completed and the morphological analysis for the next sentence can be executed. The fact that the analysis result completely matches the correct answer data does not require correction of the division model 20 at least at this point, so the natural language processing system 10 (more specifically, the analysis unit 12) has the current division model 20 Is used as is to parse the next sentence.

例えば、上述した日本語の文「本を買って（ｈｏｎｗｏｋａｔｔｅ）」についての正解データは以下の通りである。なお、便宜的に、各文字をｘ_１〜ｘ_５とも表す。
ｘ_１：｛Ｓ−Ｎ−ｎｃ｝
ｘ_２：｛Ｓ−Ｐ−ｋ｝
ｘ_３：｛Ｂ−Ｖ−ｃ｝
ｘ_４：｛Ｅ−Ｖ−ｃ｝
ｘ_５：｛Ｓ−Ｐ−ｓｊ｝For example, the correct data for the above-described Japanese sentence “Hon Wo Katte” is as follows. Incidentally, for convenience, also denoted each character _x 1 ~x _5.
x ₁ : {S-N-nc}
_{x 2: {S-P-} k}
x ₃ : {BVc}
_{x 4: {E-V-} c}
_{x 5: {S-P-} sj}

したがって、図２に示す解析結果が入力された場合には、修正部１３はその解析結果と正解データとが完全に一致すると判定し、解析部１２を修正することなく完了通知を取得部１１に出力する。 Therefore, when the analysis result shown in FIG. 2 is input, the correction unit 13 determines that the analysis result and the correct data match completely, and notifies the acquisition unit 11 of the completion notification without correcting the analysis unit 12. Output.

一方、解析結果が正解データと完全に一致しない場合（すなわち、解析結果と正解データとに差がある場合）には、修正部１３は分割モデル２０の少なくとも一部のスコアを更新する。より具体的には、修正部１３は不正解のタグに対応する正解のタグに関連する素性のスコアを現在値よりも高くするとともに、該不正解のタグに関する素性のスコアを現在値よりも低くする。 On the other hand, when the analysis result does not completely match the correct answer data (that is, when there is a difference between the analysis result and the correct answer data), the correction unit 13 updates at least a part of the scores of the divided model 20. More specifically, the correcting unit 13 sets the feature score related to the correct tag corresponding to the incorrect tag to be higher than the current value, and sets the feature score related to the incorrect tag to be lower than the current value. To do.

例えば、解析部１２が日本語の文「本を買って（ｈｏｎｗｏｋａｔｔｅ）」から下記の解析結果を得たとする。
ｘ_１：｛Ｓ−Ｎ−ｎｃ｝
ｘ_２：｛Ｓ−Ｐ−ｋ｝
ｘ_３：｛Ｂ−Ｖ−ｃ｝
ｘ_４：｛Ｉ−Ｖ−ｃ｝
ｘ_５：｛Ｅ−Ｖ−ｃ｝For example, it is assumed that the analysis unit 12 obtains the following analysis result from a Japanese sentence “Buy Book”.
x ₁ : {S-N-nc}
_{x 2: {S-P-} k}
x ₃ : {BVc}
_{x 4: {I-V-} c}
_{x 5: {E-V-} c}

この場合、解析結果が全体として間違っているので、修正部１３は、正解データ内の各タグに対応する素性を「正しい（＋１）」と評価してその素性のスコアを現在値よりも高くし、解析結果内の各タグに対応する素性を「間違い（−１）」と評価してその素性のスコアを現在値よりも低くする。結果的に相殺される部分を考慮すると、修正部１３は最終的に以下の処理を行うと言い換えることができる。 In this case, since the analysis result is wrong as a whole, the correction unit 13 evaluates the feature corresponding to each tag in the correct answer data as “correct (+1)” and sets the score of the feature higher than the current value. The feature corresponding to each tag in the analysis result is evaluated as “error (−1)”, and the score of the feature is made lower than the current value. In consideration of the part that is offset as a result, it can be said that the correction unit 13 finally performs the following processing.

修正部１３は文字ｘ_４，ｘ_５の正解のタグに対応する出力素性「Ｅ−Ｖ−ｃ／っ（ｔ）」「Ｓ−Ｐ−ｓｊ／て（ｔｅ）」についてのスコアを現在値より大きくし、不正解のタグに関連する出力素性「Ｉ−Ｖ−ｃ／っ（ｔ）」「Ｅ−Ｖ−ｃ／て（ｔｅ）」についてのスコアを現在値より小さくする。これにより、解析された文に関連するユニグラムの出力素性のスコア（文字に関するスコア）が更新される。The correction unit 13 obtains the scores for the output features “EV−c / t (t)” and “SP−sj / te (te)” corresponding to the correct tags of the characters x ₄ and x ₅ from the current value. The score for the output features “IVc / t (t)” and “EVc / t (te)” related to the incorrect answer tag is made smaller than the current value. As a result, the score of the output feature of the unigram related to the analyzed sentence (score regarding the character) is updated.

また、修正部１３は、不正解だった文字ｘ_４，ｘ_５の正解のタグに関連する出力素性「Ｅ−Ｖ−ｃ／Ｈ」「Ｓ−Ｐ−ｓｊ／Ｈ」についてのスコアを現在値より大きくし、不正解のタグに関連する出力素性「Ｉ−Ｖ−ｃ／Ｈ」「Ｅ−Ｖ−ｃ／Ｈ」についてのスコアを現在値より小さくする。これにより、解析された文に関連するユニグラムの出力素性のスコア（文字種に関するスコア）が更新される。In addition, the correcting unit 13 sets the scores for the output features “EVC / H” and “SP-sj / H” related to the correct tags of the characters x ₄ and x _{5 that} are incorrect answers to the current values. The score for the output features “IV-c / H” and “EV-c / H” related to the incorrect answer tag is made smaller than the current value. Thereby, the score of the output feature of the unigram related to the analyzed sentence (score regarding the character type) is updated.

また、修正部１３は、不正解だった文字ｘ_４，ｘ_５の正解のタグに関連する出力素性「Ｅ−Ｖ−ｃ／っ（ｔ）／て（ｔｅ）」についてのスコアを現在値より大きくし、不正解のタグに関連する出力素性「Ｉ−Ｖ−ｃ／っ（ｔ）／て（ｔｅ）」についてのスコアを現在値より小さくする。これにより、解析された文に関連するバイグラムの出力素性のスコア（文字に関するスコア）が更新される。Further, the correcting unit 13 obtains the score for the output feature “EVC / tsu (t) / te (te)” related to the correct tag of the characters x ₄ and x _{5 that} were incorrect from the current value. The score for the output feature “IV−c / t (t) / te (te)” related to the incorrect tag is made smaller than the current value. This updates the bigram output feature score (character score) associated with the analyzed sentence.

また、修正部１３は、不正解だった文字ｘ_４，ｘ_５の正解のタグに関連する出力素性「Ｅ−Ｖ−ｃ／Ｈ／Ｈ」についてのスコアを現在値より大きくし、不正解のタグに関連する出力素性「Ｉ−Ｖ−ｃ／Ｈ／Ｈ」についてのスコアを現在値より小さくする。これにより、解析された文に関連するバイグラムの出力素性のスコア（文字種に関するスコア）が更新される。In addition, the correction unit 13 increases the score for the output feature “EVC / H / H” related to the correct tags of the characters x ₄ and x _{5 that} are incorrect answers from the current value, and corrects the incorrect answer. The score for the output feature “IVc / H / H” related to the tag is made smaller than the current value. This updates the bigram output feature score (score for the character type) associated with the analyzed sentence.

また、修正部１３は不正解だった文字ｘ_４，ｘ_５の正解のタグに関連する遷移素性「Ｂ−Ｖ−ｃ／Ｅ−Ｖ−ｃ」「Ｅ−Ｖ−ｃ／Ｓ−Ｐ−ｓｊ」についてのスコアを現在値より大きくし、不正解のタグに関連する遷移素性「Ｂ−Ｖ−ｃ／Ｉ−Ｖ−ｃ」「Ｉ−Ｖ−ｃ／Ｅ−Ｖ−ｃ」についてのスコアを現在値より小さくする。これにより、解析された文に関連する遷移素性のスコアが更新される。In addition, the correcting unit 13 makes transition features “BVc / EVVc” and “EVC / SP-sj” related to the correct tags of the characters x ₄ and x _{5 that} are incorrect. The score for the transition features “BVc / IVVc” and “IVc / EVc” related to the incorrect answer tag are set to be larger than the current value. Make it smaller than the current value. As a result, the score of the transition feature related to the analyzed sentence is updated.

なお、上述したように、修正部１３は、正解データ内の各タグを「正しい（＋１）」と評価する一方で、解析結果内の各文字に関するタグを「間違い（−１）」と評価し、各タグについての二つの評価結果を相殺した上で、「正しい（＋１）」と評価されたタグに対応する素性のスコアを高くし、「間違い（−１）」と評価されたタグに対応する素性のスコアを低くしてもよい。 As described above, the correction unit 13 evaluates each tag in the correct data as “correct (+1)”, while evaluating the tag related to each character in the analysis result as “wrong (−1)”. , After offsetting the two evaluation results for each tag, increase the score of the feature corresponding to the tag evaluated as “correct (+1)”, and correspond to the tag evaluated as “wrong (−1)” The score of the feature to be performed may be lowered.

素性のスコアを更新する際に、修正部１３はＳＣＷ（ＳｏｆｔＣｏｎｆｉｄｅｎｃｅ−Ｗｅｉｇｈｔｅｄｌｅａｒｎｉｎｇ）を用いてもよい。このＳＣＷは、分散の大きいパラメータについてはまだ自信がない（正確でない）とみなしてそのパラメータを大きく更新し、分散の小さいパラメータについてはある程度正確であるとみなしてそのパラメータを小さく更新するという手法である。修正部１３は、値の範囲を有するスコアの分散に基づいて該スコアの変化量を決定する。このＳＣＷを実行するために、分割モデル２０（ベクトルｗ）にガウス分布を導入し、修正部１３は各スコアの更新に加えてそのスコアの平均および共分散行列も同時に更新する。各スコアの平均の初期値は０である。各スコアの共分散行列の初期値については、対角要素が１であり、それ以外の要素（非対角要素）は０である。図６（ａ）は、分散の大きいスコアを大きく変更する（すなわち、スコアの変化量が大きい）態様を示し、図６（ｂ）は、分散の小さいスコアを少しだけ変更する（すなわち、スコアの変化量が小さい）態様を示している。図６（ａ）および図６（ｂ）はそれぞれ、スコアをＳａからＳｂに更新した際に共分散行列Σも更新することを示している。なお、共分散行列の更新に関していうと、ある素性と他の素性との相関関係を考慮しなくてもスコアの計算の精度を保つことができるので、本実施形態では共分散行列の非対角要素を計算することなく対角要素のみを計算する。これにより、スコアの更新速度を上げることができる。 When updating the feature score, the correction unit 13 may use SCW (Soft Confidence-Weighted Learning). This SCW is a method in which a parameter with a large variance is regarded as not yet confident (inaccurate) and the parameter is greatly updated, and a parameter with a small variance is regarded as accurate to some extent and the parameter is updated to a small value. is there. The correcting unit 13 determines the amount of change in the score based on the variance of the score having the value range. In order to execute this SCW, a Gaussian distribution is introduced into the division model 20 (vector w), and the correction unit 13 simultaneously updates the average and covariance matrix of each score in addition to updating each score. The average initial value of each score is zero. For the initial value of the covariance matrix of each score, the diagonal element is 1, and the other elements (non-diagonal elements) are 0. FIG. 6A shows a mode in which a score with a large variance is changed greatly (that is, the amount of change in the score is large), and FIG. The variation is small). FIGS. 6A and 6B show that the covariance matrix Σ is also updated when the score is updated from Sa to Sb. Regarding the update of the covariance matrix, the accuracy of score calculation can be maintained without considering the correlation between a certain feature and other features, so in this embodiment, the off-diagonal of the covariance matrix is used. Calculate only diagonal elements without calculating elements. Thereby, the update speed of a score can be raised.

なお、修正部１３はＳＣＷ以外の手法を用いて素性のスコアを更新してもよい。ＳＣＷ以外の手法の例としては、Ｐｅｒｃｅｐｔｒｏｎ、ＰａｓｓｉｖｅＡｇｇｒｅｓｓｉｖｅ（ＰＡ）、ＣｏｎｆｉｄｅｎｃｅＷｅｉｇｈｔｅｄ（ＣＷ）、ＡｄａｐｔｉｖｅＲｅｇｕｌａｒｉｚａｔｉｏｎｏｆＷｅｉｇｈｔＶｅｃｔｏｒｓ（ＡＲＯＷ）が挙げられる。 The correction unit 13 may update the feature score using a method other than SCW. Examples of methods other than SCW include Perceptron, Passive Aggressive (PA), Confidence Weighted (CW), and Adaptive Regularization of Weight Vectors (AROW).

解析された文に関連する素性のスコアを更新することで分割モデル２０を修正すると、修正部１３は完了通知を生成して取得部１１に出力する。この場合には、自然言語処理システム１０（より具体的には解析部１２）は修正された分割モデル２０を用いて次の文を解析する。 When the division model 20 is corrected by updating the feature score related to the analyzed sentence, the correction unit 13 generates a completion notification and outputs it to the acquisition unit 11. In this case, the natural language processing system 10 (more specifically, the analysis unit 12) analyzes the next sentence using the modified division model 20.

次に、図７を用いて、自然言語処理システム１０の動作を説明するとともに本実施形態に係る自然言語処理方法について説明する。 Next, the operation of the natural language processing system 10 will be described with reference to FIG. 7, and the natural language processing method according to the present embodiment will be described.

まず、取得部１１が一つの文を取得する（ステップＳ１１）。続いて、解析部１２が分割モデル２０を用いてその文を形態素解析する（ステップＳ１２、解析ステップ）。この形態素解析により、文の各文字に「Ｓ−Ｎ−ｎｃ」などのようなタグが付与される。 First, the acquisition unit 11 acquires one sentence (step S11). Subsequently, the analysis unit 12 performs morphological analysis on the sentence using the division model 20 (step S12, analysis step). By this morphological analysis, a tag such as “S-N-nc” is given to each character of the sentence.

続いて、修正部１３が解析部１２による形態素解析の結果と、その形態素解析の正解データとの差を求める（ステップＳ１３）。その差がない場合（ステップＳ１４；ＮＯ）、すなわち、解析部１２による形態素解析が完全に正しい場合には、修正部１３は分割モデル２０を修正することなく処理を終了する。一方、解析結果と正解データとに差がある場合（ステップＳ１４；ＹＥＳ）、すなわち、解析部１２による形態素解析の少なくとも一部が正しくない場合には、修正部１３は解析された文に関連する素性のスコアを更新することで分割モデル２０を修正する（ステップＳ１５、修正ステップ）。具体的には、修正部１３は、不正解のタグに対応する正解のタグに関連する素性のスコアを現在値よりも高くするとともに、該不正解のタグに関連する素性のスコアを現在値よりも低くする。 Subsequently, the correcting unit 13 obtains a difference between the result of the morphological analysis by the analyzing unit 12 and correct data of the morphological analysis (step S13). When there is no difference (step S14; NO), that is, when the morphological analysis by the analyzing unit 12 is completely correct, the correcting unit 13 ends the process without correcting the divided model 20. On the other hand, when there is a difference between the analysis result and the correct answer data (step S14; YES), that is, when at least part of the morphological analysis by the analysis unit 12 is incorrect, the correction unit 13 relates to the analyzed sentence. The division model 20 is corrected by updating the feature score (step S15, correction step). Specifically, the correcting unit 13 sets the feature score related to the correct tag corresponding to the incorrect tag to be higher than the current value, and sets the feature score related to the incorrect tag from the current value. Also lower.

修正部１３での処理が完了すると、ステップＳ１１の処理に戻り（ステップＳ１６参照）。取得部１１が次の文を取得し（ステップＳ１１）、解析部１２がその文を形態素解析する（ステップＳ１２）。このとき、前の文の処理において分割モデル２０の修正（ステップＳ１５）が実行されていた場合には、解析部１２は修正された分割モデル２０を用いて形態素解析を実行する。その後、修正部１３がステップＳ１３以降の処理を実行する。このような繰り返しは、処理対象の文が存在する限り続く（ステップＳ１６参照）。 When the process in the correction unit 13 is completed, the process returns to step S11 (see step S16). The acquisition unit 11 acquires the next sentence (step S11), and the analysis unit 12 performs morphological analysis on the sentence (step S12). At this time, if the modification of the divided model 20 (step S15) has been executed in the processing of the previous sentence, the analysis unit 12 performs morphological analysis using the modified divided model 20. Thereafter, the correction unit 13 executes the processes after step S13. Such repetition continues as long as the sentence to be processed exists (see step S16).

自然言語処理システム１０の動作を示すアルゴリズムの一例を以下に示す。
Ｉｎｉｔｉａｌｉｚｅｗ_１
Ｆｏｒｔ＝１，２，…
Ｒｅｃｉｅｖｅｉｎｓｔａｎｃｅｘ_ｔ
Ｐｒｅｄｉｃｔｓｔｒｕｃｔｕｒｅｙ＾_ｔｂａｓｅｄｏｎｗ_ｔ
Ｒｅｃｅｉｖｅｃｏｒｒｅｃｔｓｔｒｕｃｔｕｒｅｙ_ｔ
Ｉｆｙ＾_ｔ≠ｙ_ｔ，ｕｐｄａｔｅ
ｗ_ｔ＋１＝ｕｐｄａｔｅ（ｗ_ｔ，ｙ_ｔ，＋１）
ｗ_ｔ＋１＝ｕｐｄａｔｅ（ｗ_ｔ，ｙ＾_ｔ，−１）An example of an algorithm showing the operation of the natural language processing system 10 is shown below.
Initialize w ₁
For t = 1, 2,...
Receive instance x _t
Predict structure y ^ _t based on w _t
Receive correct structure y _t
If y ^ _t ≠ y _t , update
w _{t + 1} = update (w _t , y _t , + 1)
w _{t + 1} = update (w _t , y ^ _t , −1)

上記アルゴリズムにおける１行目は分割モデル２０（変数ｗ_１）の初期化を意味し、この処理により、例えば各素性のスコアが０に設定される。２行目のＦｏｒループは、３行目以降の処理を一文ずつ実行することを示す。３行目は、文ｘ_ｔを取得することを意味し、上記のステップＳ１１に相当する。４行目は、その時点の分割モデル２０（ｗ_ｔ）に基づく形態素解析をすることで各文字にタグを付与する処理を示し、上記のステップＳ１２に相当する。ｙ＾_ｔは解析結果を示す。５行目は、文ｘ_ｔの形態素解析の正解データｙ_ｔを取得することを意味する。６行目は、解析結果ｙ＾_ｔと正解データｙ_ｔとに差がある場合には分割モデル２０を更新（修正）することを意味する。７行目は、正解データｙ_ｔを正例として学習することを示し、８行目は、誤りを含む解析結果ｙ＾_ｔを負例として学習することを示す。７，８行目の処理は上記のステップＳ１５に相当する。The first line in the algorithm means initialization of the division model 20 (variable w ₁ ), and for example, the score of each feature is set to 0 by this processing. The For loop on the second line indicates that the processes on and after the third line are executed one sentence at a time. The third line means that the sentence _xt is acquired and corresponds to step S11 described above. The fourth line shows a process of assigning a tag to each character by performing a morphological analysis based on the division model 20 (w _t ) at that time, and corresponds to step S12 described above. y ^ _t indicates the analysis result. Line 5, which means that to get the correct data y _t of the morphological analysis of the sentence x _t. Line 6, if there is a difference between the analysis result y ^ _t and solution data y _t means to update (modifying) the division model 20. The seventh line indicates that the correct answer data y _t is learned as a positive example, and the eighth line indicates that the analysis result y ^ _t including an error is learned as a negative example. The processing on the seventh and eighth lines corresponds to step S15 described above.

次に、図８を用いて、自然言語処理システム１０を実現するための自然言語処理プログラムＰ１を説明する。 Next, a natural language processing program P1 for realizing the natural language processing system 10 will be described with reference to FIG.

自然言語処理プログラムＰ１は、メインモジュールＰ１０、取得モジュールＰ１１、解析モジュールＰ１２、および修正モジュールＰ１３を備える。 The natural language processing program P1 includes a main module P10, an acquisition module P11, an analysis module P12, and a correction module P13.

メインモジュールＰ１０は、形態素解析およびこの関連処理を統括的に制御する部分である。取得モジュールＰ１１、解析モジュールＰ１２、および修正モジュールＰ１３を実行することにより実現される機能はそれぞれ、上記の取得部１１、解析部１２、および修正部１３の機能と同様である。 The main module P10 is a part that comprehensively controls morphological analysis and related processing. The functions realized by executing the acquisition module P11, the analysis module P12, and the correction module P13 are the same as the functions of the acquisition unit 11, the analysis unit 12, and the correction unit 13, respectively.

自然言語処理プログラムＰ１は、例えば、ＣＤ−ＲＯＭやＤＶＤ−ＲＯＭ、半導体メモリなどの有形の記録媒体に固定的に記録された上で提供されてもよい。また、自然言語処理プログラムＰ１は、搬送波に重畳されたデータ信号として通信ネットワークを介して提供されてもよい。 The natural language processing program P1 may be provided after being fixedly recorded on a tangible recording medium such as a CD-ROM, DVD-ROM, or semiconductor memory. The natural language processing program P1 may be provided via a communication network as a data signal superimposed on a carrier wave.

以上説明したように、本発明の一側面に係る自然言語処理システムは、１以上のトレーニングデータを用いた機械学習により得られる分割モデルを用いて、一つの文に対する形態素解析を実行することで、該一つの文を分割して得られる個々の被分割要素に、少なくとも単語の品詞を示すタグを設定する解析部であって、分割モデルが、被分割要素とタグとの対応を示す出力素性のスコアと、連続する二つの被分割要素に対応する二つのタグの組合せを示す遷移素性のスコアとを含む、該解析部と、解析部により得られた解析結果で示されるタグと、一つの文の正解のタグを示す正解データとを比較し、不正解のタグに対応する正解のタグに関連する出力素性のスコアおよび遷移素性のスコアを現在値よりも高くし、該不正解のタグに関連する出力素性のスコアおよび遷移素性のスコアを現在値よりも低くすることで、解析部による次の文の形態素解析で用いられる分割モデルを修正する修正部とを備える。 As described above, the natural language processing system according to one aspect of the present invention performs a morphological analysis on one sentence using a division model obtained by machine learning using one or more training data. An analysis unit that sets at least a tag indicating the part of speech of each word to each divided element obtained by dividing the one sentence, and the division model has an output feature indicating a correspondence between the divided element and the tag. The analysis unit including a score and a score of a transition feature indicating a combination of two tags corresponding to two consecutive subdivided elements, a tag indicated by an analysis result obtained by the analysis unit, and one sentence Compared with the correct answer data indicating the correct answer tag, the score of the output feature and the transition feature related to the correct answer tag corresponding to the incorrect answer tag are set higher than the current value, and related to the incorrect answer tag. You Score score and transition identity of the output feature is made lower than the current value, and a correction unit for correcting the divided model used by morphological analysis of the next sentence by the analysis unit.

また、正解したタグに関する素性スコアを高くし、不正解だったタグに関する素性のスコアを低くすることで、次の文の形態素解析の精度をより高くすることができる。 Moreover, the accuracy of the morphological analysis of the next sentence can be further increased by increasing the feature score regarding the correct tag and decreasing the feature score regarding the tag that is incorrect.

他の側面に係る自然言語処理システムでは、被分割要素が文字であってもよい。文字単位での知識（出力素性および遷移素性）を用いて文字毎に処理することで、一般的に大規模になってしまう単語辞書を用いることなく、形態素解析を実行することができる。また、単語の知識ではなく文字単位での知識を用いて一文毎に分割モデルが修正されるので、次の文が、これまで解析されたいずれの文とも分野または性質が異なったものであるとしても、当該次の文を高精度に形態素解析することが可能である。すなわち、本発明の一側面に係る自然言語処理システムは、未知の分野の文または未知の性質を持つ文に対する適応性を有する。 In the natural language processing system according to another aspect, the divided element may be a character. By processing for each character using knowledge (output feature and transition feature) in units of characters, morphological analysis can be executed without using a word dictionary that generally becomes large. In addition, since the division model is corrected for each sentence using knowledge in units of characters rather than knowledge of words, it is assumed that the next sentence is different in field or nature from any sentence analyzed so far In addition, it is possible to perform morphological analysis of the next sentence with high accuracy. That is, the natural language processing system according to one aspect of the present invention has adaptability to a sentence in an unknown field or a sentence having an unknown property.

他の側面に係る自然言語処理システムでは、出力素性のスコアおよび遷移素性のスコアのそれぞれが値の範囲を有し、各スコアについて分散が設定され、修正部が、各スコアの分散に基づいて、該スコアを高くまたは低く際の該スコアの変化量を決定してもよい。この手法を用いることで、各素性のスコアを早く収束させることが可能になる。 In the natural language processing system according to another aspect, each of the output feature score and the transition feature score has a range of values, a variance is set for each score, and the correction unit is based on the variance of each score. The amount of change in the score when the score is high or low may be determined. By using this method, it is possible to quickly converge the scores of each feature.

以上、本発明をその実施形態に基づいて詳細に説明した。しかし、本発明は上記実施形態に限定されるものではない。本発明は、その要旨を逸脱しない範囲で様々な変形が可能である。 The present invention has been described in detail based on the embodiments. However, the present invention is not limited to the above embodiment. The present invention can be variously modified without departing from the gist thereof.

一般に、分割モデル２０に含まれる素性の個数は取り扱う文字の個数に応じて増えるので、日本語や中国語などのように文字が多い言語では分割モデル２０が非常に大規模になり、分割モデル２０のための記憶容量も非常に大きくなってしまう。そこで、フィーチャー・ハッシング（ＦｅａｔｕｒｅＨａｓｈｉｎｇ）という手法を導入して、個々の素性をハッシュ関数により数値化してもよい。特に、素性の一部を表す文字および文字列を数値化することの効果が高い。その一方で、遷移素性はハッシュ化しても分割モデル２０の容量の圧縮にはそれほど貢献せず、処理速度が却って遅くなる可能性がある。したがって、遷移素性をハッシュ化することなく出力素性のみをハッシュ化してもよい。なお、ハッシュ関数については、一種類のみを用いてもよいし、出力素性と遷移素性とで異なるハッシュ関数を用いてもよい。 In general, since the number of features included in the division model 20 increases according to the number of characters handled, the division model 20 becomes very large in a language with many characters such as Japanese and Chinese. The storage capacity for will also be very large. Therefore, a technique called feature hashing may be introduced to digitize individual features using a hash function. In particular, the effect of digitizing characters and character strings representing a part of the features is high. On the other hand, even if the transition feature is hashed, it does not contribute much to the compression of the capacity of the division model 20, and the processing speed may be slow. Therefore, only the output features may be hashed without hashing the transition features. Note that only one type of hash function may be used, or different hash functions may be used for output features and transition features.

この場合には、分割モデル２０は、個々の文字が数値で表された素性についてのデータを記憶する。例えば、「本（ｈｏｎ）」という文字が３４という数値に変換され、「を（ｗｏ）」という文字が４７８８という数値に変換される。この数値化により、有界な（ｂｏｕｎｄｅｄ）素性の集合を形成することができる。なお、このフィーチャー・ハッシングにより、複数の文字または文字列に同じ数値が割り当てられることがあり得るが、出現頻度が高い文字または文字列同士に同じ数値が割り当てられる蓋然性は非常に低いので、このような衝突は無視することができる。 In this case, the division | segmentation model 20 memorize | stores the data about the feature in which each character was represented by the numerical value. For example, a character “hon” is converted to a numerical value of 34, and a character “wo” is converted to a numerical value of 4788. By this numericalization, a set of bounded features can be formed. This feature hashing may assign the same numerical value to multiple characters or character strings, but it is very unlikely that the same numerical value will be assigned to characters or character strings that appear frequently. Collisions can be ignored.

すなわち、他の側面に係る自然言語処理システムでは、分割モデルが、ハッシュ関数により数値化された出力素性を含んでもよい。文字を数値で扱うことで、分割モデルの記憶に必要なメモリ容量を節約することができる。 That is, in the natural language processing system according to another aspect, the division model may include an output feature that is quantified by a hash function. By handling characters numerically, it is possible to save the memory capacity required for storing the division model.

解析部１２は、スコアが相対的に低い素性を用いることなく（そのような素性を無視して）、スコアが相対的に高い素性を用いて形態素解析を実行してもよい。スコアが相対的に低い素性を無視する手法として、例えば、フォワード・バックワード・スプリッティング（Ｆｏｒｗａｒｄ−ＢａｃｋｗａｒｄＳｐｌｉｔｔｉｎｇ（ＦＯＢＯＳ））と、素性の量子化（ＦｅａｔｕｒｅＱｕａｎｔｉｚａｔｉｏｎ）とが挙げられる。 The analysis unit 12 may perform the morphological analysis using a feature having a relatively high score without using a feature having a relatively low score (ignoring such a feature). As a method of ignoring a feature having a relatively low score, for example, forward-backward splitting (FOBOS) and feature quantization (Feature Quantization) can be mentioned.

ＦＯＢＯＳは、正則化（例えばＬ１正則化）によりスコアを０の方に圧縮する手法である。ＦＯＢＯＳを用いることで、スコアが所定値以下の素性（例えば、スコアが０の素性、またはスコアが０に近い素性）を無視することが可能になる。 FOBOS is a method of compressing a score to 0 by regularization (for example, L1 regularization). By using FOBOS, it is possible to ignore a feature whose score is a predetermined value or less (for example, a feature whose score is 0 or a feature whose score is close to 0).

素性の量子化は、小数点以下の値に１０^ｎ（ｎは１以上の自然数）を乗ずることで素性のスコアを整数化する手法である。例えば、「０．１２３４５６７８９」というスコアに１０００を乗じて整数化するとスコアは「１２３」する。スコアを量子化することで、そのスコアをテキストで記憶するために必要なメモリ容量を節約することができる。また、この手法により、スコアが所定値以下の素性（例えば、整数化後のスコアが０の素性、または該スコアが０に近い素性）を無視することが可能になる。例えば、ある素性Ｆａ，Ｆｂのスコアがそれぞれ０．０５１２、０．０００３であるとして、これらのスコアに１０００を乗じて整数化した場合には、素性Ｆａ，Ｆｂはそれぞれ５１、０になる。この場合には、解析部１２は素性Ｆｂを用いることなく形態素解析を実行する。The feature quantization is a technique for converting the score of a feature into an integer by multiplying the value after the decimal point by 10 ⁿ (n is a natural number of 1 or more). For example, when the score “0.1234456789” is multiplied by 1000 to make an integer, the score becomes “123”. By quantizing the score, the memory capacity required to store the score as text can be saved. In addition, this technique makes it possible to ignore features whose score is equal to or less than a predetermined value (for example, a feature whose score after integerization is 0 or a feature whose score is close to 0). For example, assuming that the scores of certain features Fa and Fb are 0.0512 and 0.0003, respectively, and multiplying these scores by 1000, the features Fa and Fb are 51 and 0, respectively. In this case, the analysis unit 12 performs morpheme analysis without using the feature Fb.

正則化または量子化の処理は、例えば、修正部１３、自然言語処理システム１０内の他の機能要素、あるいは自然言語処理システム１０とは別のコンピュータシステムで実行される。修正部１３が正則化または量子化の処理を実行する場合には、修正部１３は自然言語処理システム１０において１セットの文（例えば、ある程度多くの文）が形態素解析されて分割モデル２０が何度も修正された後に、正則化または量子化の処理を一回実行する。 The regularization or quantization process is executed by, for example, the correction unit 13, another functional element in the natural language processing system 10, or a computer system different from the natural language processing system 10. When the correction unit 13 performs regularization or quantization processing, the correction unit 13 performs a morphological analysis on a set of sentences (for example, a certain number of sentences) in the natural language processing system 10 to determine what the division model 20 is. After the correction, the regularization or quantization process is performed once.

すなわち、他の側面に係る自然言語処理システムでは、解析部が、正則化または量子化によりスコアが所定値以下になった素性を用いることなく形態素解析を実行してもよい。スコアが相対的に低い素性（例えば、正則化または量子化によりスコアが０になる素性、または該スコアが０に近い素性）を使わないことで、分割モデルのデータ量を抑えたり形態素解析の時間を短縮したりすることができる。 That is, in the natural language processing system according to another aspect, the analysis unit may execute the morphological analysis without using a feature whose score is equal to or lower than a predetermined value by regularization or quantization. By not using features with relatively low scores (for example, features whose score becomes 0 by regularization or quantization, or features whose score is close to 0), it is possible to reduce the amount of data in the split model or to perform morpheme analysis time Can be shortened.

上記実施形態では解析部１２が文を個々の文字に分割して各文字にタグを設定したが、被分割要素は文字ではなく単語であってもよい。これに伴い、解析部は、文字ではなく単語に関する素性のスコアを示す分割モデルと単語辞書とを用いて形態素解析を実行してもよい。 In the above embodiment, the analysis unit 12 divides a sentence into individual characters and sets a tag for each character. However, the divided element may be a word instead of a character. Accordingly, the analysis unit may perform morphological analysis using a division model and a word dictionary that indicate a score of a feature related to a word instead of a character.

上述した通り、本発明に係る自然言語処理システムは、任意の言語の形態素解析に適用することができる。 As described above, the natural language processing system according to the present invention can be applied to morphological analysis of an arbitrary language.

１０…自然言語処理システム、１１…取得部、１２…解析部、１３…修正部、２０…分割モデル、Ｐ１…自然言語処理プログラム、Ｐ１０…メインモジュール、Ｐ１１…取得モジュール、Ｐ１２…解析モジュール、Ｐ１３…修正モジュール。
DESCRIPTION OF SYMBOLS 10 ... Natural language processing system, 11 ... Acquisition part, 12 ... Analysis part, 13 ... Correction part, 20 ... Division model, P1 ... Natural language processing program, P10 ... Main module, P11 ... Acquisition module, P12 ... Analysis module, P13 ... modification module.

Claims

By using a division model obtained by machine learning using one or more training data and performing morphological analysis on one sentence, each divided element obtained by dividing the one sentence has at least a word An analysis unit for setting a tag indicating the part of speech of the combination, wherein the division model is a combination of a score of an output feature indicating a correspondence between a divided element and a tag and two tags corresponding to two consecutive divided elements Including the transition feature score indicating
The tag indicated by the analysis result obtained by the analysis unit is compared with correct data indicating the correct tag of the one sentence, and the score of the output feature related to the correct tag corresponding to the incorrect tag And the score of the transition feature is made higher than the current value, and the score of the output feature and the score of the transition feature related to the incorrect answer tag are made lower than the current value. A natural language processing system comprising: a correction unit that corrects the division model used in the morphological analysis.

The split element is a character;
The natural language processing system according to claim 1.

The division model includes the output feature quantified by a hash function;
The natural language processing system according to claim 1 or 2.

Each of the output feature score and the transition feature score has a range of values, and a variance is set for each score,
The correction unit determines the amount of change in the score when the score is increased or decreased based on the variance of each score.
The natural language processing system according to any one of claims 1 to 3.

The analysis unit performs the morphological analysis without using the feature whose score is equal to or less than a predetermined value by regularization or quantization.
The natural language processing system according to any one of claims 1 to 4.

A natural language processing method executed by a natural language processing system including a processor,
By using a division model obtained by machine learning using one or more training data and performing morphological analysis on one sentence, each divided element obtained by dividing the one sentence has at least a word An analysis step of setting a tag indicating the part of speech of the combination, wherein the division model is a combination of an output feature score indicating a correspondence between a divided element and a tag and two tags corresponding to two consecutive divided elements The analysis step comprising a transition feature score indicative of
The tag indicated by the analysis result obtained in the analysis step is compared with correct data indicating the correct tag of the one sentence, and the score of the output feature related to the correct tag corresponding to the incorrect tag And the score of the transition feature is made higher than the current value, and the score of the output feature and the score of the transition feature related to the incorrect answer tag are made lower than the current value, so that the next sentence in the analysis step A natural language processing method including a correction step of correcting the division model used in the morphological analysis.

By using a division model obtained by machine learning using one or more training data and performing morphological analysis on one sentence, each divided element obtained by dividing the one sentence has at least a word An analysis unit for setting a tag indicating the part of speech of the combination, wherein the division model is a combination of a score of an output feature indicating a correspondence between a divided element and a tag and two tags corresponding to two consecutive divided elements Including the transition feature score indicating
The tag indicated by the analysis result obtained by the analysis unit is compared with correct data indicating the correct tag of the one sentence, and the score of the output feature related to the correct tag corresponding to the incorrect tag And the score of the transition feature is made higher than the current value, and the score of the output feature and the score of the transition feature related to the incorrect answer tag are made lower than the current value. A natural language processing program for causing a computer to function as a correction unit for correcting the division model used in the morphological analysis.