JP2008217592A

JP2008217592A - Language analysis model learning device, language analysis model learning method, language analysis model learning program and recording medium

Info

Publication number: JP2008217592A
Application number: JP2007056109A
Authority: JP
Inventors: Jun Suzuki; 潤鈴木
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2007-03-06
Filing date: 2007-03-06
Publication date: 2008-09-18
Anticipated expiration: 2027-03-06
Also published as: JP4328362B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a means capable of strictly determining an error estimation function in segment unit in estimation of a parameter vector with the same calculation amount as that of an approximate calculation method used in a conventional method. <P>SOLUTION: The language model learning device comprises a storage part which stores an output candidate graph in which an output label candidate showing a type of classification tag is associated with learning data 144, and an initial value of a parameter vector 145; and a parameter learning part 1413 which calculates a difference between the appearance probability of a correct answer element of the learning data 114 and the appearance probability of an element which is most likely to be output other than the correct answer element as an error estimation function, using a determination function which shows the appearance probability of a predetermined evaluation unit of the learning data 144 by a peripheral probability with the parameter vector 145 as a variable, and calculates the parameter vector 145 which optimizes an intended function set using the estimation function. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は言語解析処理の技術に関し、特に対象テキストに文法的又は意味的なラベルを付与するために用いる特徴パラメタベクトルを学習データ（コーパス）から作成する技術に関する。 The present invention relates to a language analysis technique, and more particularly to a technique for creating a feature parameter vector used for assigning a grammatical or semantic label to a target text from learning data (corpus).

主に対象テキスト（以下、生テキスト）に対して、品詞の付与（専門用語でタギンク又はラベリングと呼ぶ）や、単語・文節区切り又は意味のある単語の連続の抽出（専門用語でチャンキング又はセグメンテーションと呼ぶ）を行う技術がある。
計算機上では、生テキストは、文章の先頭から末尾までの一本のストリーム(一次元上の記号列)とみなして処理されることから、テキストのラベリング及びセグメンテーションは、自然言語処理／機械学習分野の用語で系列ラベリングまたは、系列セグメンテーションという問題の種類に分類される。 Mainly the subject text (hereinafter referred to as raw text) is given part-of-speech (referred to as terminology or labeling in technical terms), word / sentence breaks or extraction of meaningful words (chunking or segmentation in technical terms) Technology).
On a computer, raw text is processed as a single stream (one-dimensional symbol string) from the beginning to the end of a sentence, so text labeling and segmentation are in the natural language processing / machine learning field. Are classified into the problem types of sequence labeling or sequence segmentation.

このような、系列セグメンテーション及び系列ラベリングの例を図１に示す。
図１に示すように、系列ラベリングとは、入力される系列に対して、ラベルを付与する問題と定義される（ただし、系列（Sequence）とは記号の列を意味し、生テキストの場合は文字列や単語列等に相当）。また、自然言語処理分野では、単語列に対して品詞タグを付与する問題等がこれに含まれる。
また、同じく図１に示すように、系列セグメンテーションとは、入力系列に対して、部分系列単位の区分（セグメント）を求める問題と定義される。ただし、セグメントとは、一つ以上の記号の連続から成る（部分）記号列である。自然言語処理分野では、単語列に対して文節区切りを付与する問題等がこれに含まれる。 An example of such sequence segmentation and sequence labeling is shown in FIG.
As shown in FIG. 1, sequence labeling is defined as a problem of labeling an input sequence (however, a sequence means a sequence of symbols, and in the case of raw text) Equivalent to character strings and word strings). Further, in the natural language processing field, this includes a problem of giving a part of speech tag to a word string.
Similarly, as shown in FIG. 1, sequence segmentation is defined as a problem of obtaining a partial sequence unit segment for an input sequence. Here, a segment is a (partial) symbol string composed of a sequence of one or more symbols. In the field of natural language processing, this includes a problem of adding a phrase break to a word string.

前記の系列セグメンテーションは、容易に系列ラベリングに帰着させることが可能であり、かつ、系列セグメンテーション問題を直接解くモデルを扱うよりも、系列ラベリング問題として解く方が簡単であるため、一般的には、系列ラベリングのモデルを用いて問題を解き、その後にセグメントに変換する方法が用いられる。 The sequence segmentation can be easily reduced to sequence labeling, and is generally easier to solve as a sequence labeling problem than to deal with a model that directly solves the sequence segmentation problem. A method is used in which a problem is solved using a model of sequence labeling and then converted into segments.

この系列セグメンテーションから系列ラベリングを解く方法として、「ＩＯＢラベリング法」が提案されている。このＩＯＢラベリング法では、セグメントの開始（Ｂ）と続き（Ｉ）、及びセグメント外（Ｏ）を現すラベルＢ，Ｉ，Ｏを用いてセグメントを個々のラベルに分割・変換をおこなう。
ここで、図２はＩＯＢラベリング法を説明する図面である。図２に示すように、セグメントの開始を示すラベルＢに注目することで、容易にかつ一意にラベルからセグメントに、又はセグメントからラベルに変換することが可能となる。 As a method for solving the sequence labeling from the sequence segmentation, an “IOB labeling method” has been proposed. In this IOB labeling method, a segment is divided and converted into individual labels using labels B, I, and O representing the start (B) and continuation (I) of the segment and the outside (O) of the segment.
Here, FIG. 2 is a diagram for explaining the IOB labeling method. As shown in FIG. 2, by paying attention to the label B indicating the start of the segment, it is possible to easily and uniquely convert from the label to the segment or from the segment to the label.

このように、系列セグメンテーションの問題は、ＩＯＢラベリング法により系列ラベリングと同じ問題とみなすことが可能であるため、系列ラベリングの問題として同一のモデルを用いて解くことができる。 Thus, the sequence segmentation problem can be regarded as the same problem as the sequence labeling by the IOB labeling method, and therefore can be solved using the same model as the sequence labeling problem.

ここで、系列ラベリング問題で、入力系列（観測データ）をｘ＝｛ｘ_１，…，ｘ_ｎ｝、出力系列をｙ＝｛ｙ_１，…，ｙ_ｎ｝と表すこととする。例えば、品詞タグ付けの場合は、ｘは生テキストに相当し、ｘ_ｉはｉ番目の単語になる。また、ｙは、ｘの生テキストに対応する品詞タグ列であり、ｙ_ｉはｉ番目の単語に付与される品詞となる。ここで、図３は、品詞タグ付けの例を示す図面である。図３に示すように、「山田太郎は日本の首相です。」という生テキストｘに対して、各単語ｘ_ｉについてその品詞を付与することを考える。 Here, in the sequence labeling problem, an input sequence (observation data) is represented as x = {x ₁ ,..., X _n }, and an output sequence is represented as y = {y ₁ ,..., Y _n }. For example, in the case of part-of-speech tagging, x corresponds to raw text and _xi is the i-th word. Also, y is a part of speech tag string corresponding to the raw text of x, and y _i is a part of speech given to the i th word. Here, FIG. 3 is a diagram showing an example of part-of-speech tagging. As shown in FIG. 3, it is considered that the part of speech for each word x _i is given to the raw text x “Taro Yamada is the prime minister of Japan”.

このとき、系列ラベリング問題では、任意の一つの出力ラベルｙ_ｉはその周囲のラベル（例えば、ｙ_ｉ−１やｙ_ｉ＋１）に依存して決定される。つまり、個々のラベルｙ_ｉは、入力系列ｘのみではなく出力系列ｙ自身に依存して決定される変数であると言える。よって、部分的に入力系列がまったく同じであっても、周囲の推定された出力ラベルによってはまったく別のラベルが付与される可能性がある。このように出力系列ｙ自身に相互依存性（または、相互依存構造）があるような問題を機械学習の研究分野では「構造学習問題」とよび、大まかには、局所最適解を組み合わせて解を得る古典的なモデルと、出力全体の大域的な最適解を直接学習／推定するモデルに分類することができる。それぞれの代表的なモデルとして、カスケードモデルと条件付き確率場がある。 At this time, in the sequence labeling problem, any one output label y _i is determined depending on the surrounding labels (for example, y _i−1 or y _{i + 1} ). That is, it can be said that each label y _i is a variable determined depending on not only the input sequence x but also the output sequence y itself. Therefore, even if the input sequences are partially the same, a completely different label may be given depending on the estimated output labels in the surrounding area. A problem in which the output sequence y itself has an interdependency (or interdependence structure) is called a “structural learning problem” in the field of machine learning research, and is roughly combined with a local optimal solution. The obtained classic model and the global optimum solution of the entire output can be classified into models that directly learn / estimate. As typical models, there are a cascade model and a conditional random field.

カスケードモデルは、構造学習問題を解く古典的な方法である。従来、相互依存構造を持つ問題の大域的最適解を求めるためには、計算量が非常に大きくなるため、大域的最適解を求めることは現実的な解法ではなかった。そこで、局所最適解を組み合わせて全体の解を得る方法として、カスケードモデルが提案された。
ここで、図４は、カスケードモデルを適用した系列ラベリングの例を示す説明図である。局所最適解としては、図４に示すように、個々のラベルｙ_ｉを独立に学習／推定する。この個々のラベルの学習／推定は、単純な分類問題とみなすことができるので、一般的に用いられる分類問題用の機械学習手法を用いることができる。つまり、カスケードモデルでは、個々のラベルごとに問題を分割して（多クラス）分類問題として一般的な学習手法を用いて解くことができる。 The cascade model is a classic method for solving structural learning problems. Conventionally, in order to obtain a global optimum solution of a problem having an interdependent structure, the amount of calculation becomes very large, and thus obtaining a global optimum solution has not been a realistic solution. Therefore, a cascade model has been proposed as a method for obtaining an overall solution by combining local optimal solutions.
Here, FIG. 4 is an explanatory diagram showing an example of sequence labeling to which the cascade model is applied. As the local optimal solution, as shown in FIG. 4, each label y _i is independently learned / estimated. Since learning / estimation of the individual labels can be regarded as a simple classification problem, a generally used machine learning technique for classification problems can be used. In other words, in the cascade model, the problem can be divided for each label (multi-class) and solved as a classification problem using a general learning method.

一方、条件付き確率場は、従来困難とされた大域的な最適解を求めるために考案されたモデルの一つである。機械学習の分野では、カスケードモデルが示すように、従来、データ中の個々のインスタンスは独立であるという仮定で学習を行ってきた。しかし、現実のデータでは、個々のインスタンス間に複雑な依存関係がある場合が多いことが知られている。
例えば、Ｗｅｂのホームページのカテゴリ分類の場合は、個々のインスタンス（ホームページ）を独立に扱って分類することができるが、文中の単語に対する品詞分類では、前後の文脈の情報によって与えられるラベルは異なるという特徴を持っている。 On the other hand, the conditional random field is one of models devised to obtain a global optimum solution that has been considered difficult in the past. In the field of machine learning, as shown by the cascade model, learning has been performed on the assumption that individual instances in the data are independent. However, it is known that in actual data, there are many cases where there are complex dependencies between individual instances.
For example, in the case of Web homepage category classification, individual instances (homepages) can be handled and classified independently, but in part-of-speech classification for words in a sentence, labels given by context information before and after are different. Has characteristics.

このように依存関係がある構造の学習を行う一つの方法として、マルコフ確率場(markov random field)という方法がある。ただし、マルコフ確率場は、もともと入力と出力の同時確率（ｐ（ｙ，ｘ））を求める方法として用いられるモデルであり、テキストのように、入力ｘが既知の場合には同時確率ではなく、条件付き確率（ｐ（ｘ｜ｙ））を求める方が、テキストの解析モデルを直接モデル化することになるので、解析性能が高くなることが推察できる。そこで、マルコフ確率場の枠組で、同時確率ではなく条件付き確率でも問題を解けるように改良した方法として、非特許文献１及び非特許文献２には、条件付き確率場（Conditiona1 Random Fields）が提案されている。これらの方法は、周辺の文脈に依存して出力を学習／推定するためのモデルである。 One method for learning such a structure having a dependency relationship is a Markov random field. However, the Markov random field is a model that is originally used as a method for obtaining the joint probability (p (y, x)) of the input and output, and is not the joint probability when the input x is known as in text, Since it is possible to directly model an analysis model of text by obtaining a conditional probability (p (x | y)), it can be inferred that the analysis performance is improved. Therefore, in the framework of the Markov random field, a conditional random field (Conditiona1 Random Fields) is proposed in Non-Patent Document 1 and Non-Patent Document 2 as an improved method so that the problem can be solved with a conditional probability instead of a joint probability Has been. These methods are models for learning / estimating the output depending on the surrounding context.

数多くの実験から、条件付き確率場のように大域的な情報を用いて最適化した方が、局所的な情報を組み上げて問題を解くカスケードモデルよりも性能がよいことが示されている。
ここで、図５は、条件付き確率場を適用した系列ラベリングの例を示す説明図である。図５に示すように、系列ラベリング問題を条件付き確率場(マルコフ確率場でも同じ)でモデル化する場合には、取り得る出力ラベル系列をあらわしたラティス状の出力候補グラフに含まれるパス中から、始端ノードから終端ノードヘの最適パス（最も確率の高いパス）を求める問題に帰着される。
なお、条件付き確率場を用いてモデルを「学習（パラメタ推定とも言う）」するとは、特徴抽出により抽出された特徴に対応したパラメタベクトルの重みを決定する作業である。 Numerous experiments have shown that optimization using global information, such as a conditional random field, performs better than a cascade model that builds up local information and solves the problem.
Here, FIG. 5 is an explanatory diagram illustrating an example of sequence labeling to which a conditional random field is applied. As shown in FIG. 5, when a series labeling problem is modeled with a conditional random field (the same is true for a Markov random field), the path is included in a lattice-like output candidate graph representing a possible output label series. This results in a problem of finding the optimum path (the path with the highest probability) from the start node to the end node.
Note that “learning (also referred to as parameter estimation)” a model using a conditional random field is an operation of determining the weight of a parameter vector corresponding to a feature extracted by feature extraction.

また、特徴抽出とは、一般的な機械学習問題で同じ処理であるが、入力ｘと出力ｙの関係を特徴付けるような情報を抽出する処理である。基本的に、入力ｘから特徴を抽出する特徴抽出関数は、対象とする問題に依存して人手で決定するのが一般的である。
ここで、図６は特徴抽出の例を示す説明図である。図６に示した例では、推定したいラベル（ｉ＝４）の前後２単語と、推定したいラベルとの組み合わせで特徴を特徴ベクトルとして抽出する。このように、系列ラベリング問題の場合は、入力ｘと出力ｙとの組合せで特徴を抽出することが一般的である。図６に示した例では、任意のラベルｙ_ｉを推定する際に利用する特徴（関数）として、ｘ_ｉ−２，ｘ_ｉ−１，ｘ_ｉ，ｘ_ｉ＋１，ｘ_ｉ＋２とｙ_ｉの組合せを抽出する。 Also, feature extraction is processing that extracts the information that characterizes the relationship between input x and output y, although it is the same processing in a general machine learning problem. Basically, the feature extraction function for extracting features from the input x is generally determined manually depending on the target problem.
Here, FIG. 6 is an explanatory diagram showing an example of feature extraction. In the example shown in FIG. 6, the feature is extracted as a feature vector by combining the two words before and after the label to be estimated (i = 4) and the label to be estimated. As described above, in the case of the sequence labeling problem, it is common to extract features by a combination of the input x and the output y. In the example shown in FIG. 6, combinations of x _i−2 , x _i−1 , x _i , x _{i + 1} , x _{i + 2} and y _i are used as features (functions) used when estimating an arbitrary label y _i. Extract.

パラメタベクトルの重みを決定する学習は、特徴抽出関数により抽出された個々の特徴に対応するパラメタ（ベクトル）λの値を推定する問題となる。ここで「対応する」とは、個々の特徴とパラメタベクトルの要素が一対一に対応することを意味し、特徴ベクトルとパラメタベクトルの次元数は同じとなり、またｉ番目の特徴はｉ番目のパラメタベクトルの要素で重みが決定される。 Learning to determine the weight of the parameter vector is a problem of estimating the value of the parameter (vector) λ corresponding to each feature extracted by the feature extraction function. Here, “corresponding” means that each feature and parameter vector elements correspond one-to-one, the feature vector and the parameter vector have the same number of dimensions, and the i-th feature is the i-th parameter. The weight is determined by the elements of the vector.

条件付き確率場は以下のように定義される。入力(観測データ)に対する確率変数をｘ、出力に対する確率変数をｙ＝｛ｙ_１，…，ｙ_ｎ｝とする。ここで、確率変数ｙがマルコフ性を満たすとき、すなわち、ｐ（ｙ_ｉ｜ｘ，ｙ_１，…，ｙ_ｉ−１，ｙ_ｉ＋１，…，ｙ_ｎ）＝ｐ（ｙ_ｉ｜ｘ，Ｎ（ｙ_ｉ））の条件を満たすとき、（ｘ，ｙ）は条件付き確率場である。ただし、Ｎ（ｙ_ｉ）は、個々確率変数ｙ_ｉ間の依存関係をグラフで表した際に接続する確率変数の集合を表す。つまり、確率変数ｙ_ｉは、接続する確率変数Ｎ（ｙ_ｉ）およびｘに依存して決まる変数である。一般的に、確率変数ｙ_ｉ間の接続を表すグラフ内のクリーク集合をｃとすると、条件付き確率場は、個々のクリークのポテンシャル関数の対数線形モデルになる。 The conditional random field is defined as follows: The random variable for the input (observed data) is x, and the random variable for the output is y = {y ₁ ,..., Y _n }. Here, when the random variable y satisfies the Markov property, that is, p (y _i | x, y ₁ ,..., Y _i−1 , y _{i + 1} ,..., Y _n ) = p (y _i | x, N ( When the condition of y _i )) is satisfied, (x, y) is a conditional random field. Here, N (y _i ) represents a set of random variables that are connected when the dependency relationship between the individual random variables y _i is represented by a graph. That is, the random variable y _i is a variable determined depending on the connected random variable N (y _i ) and x. In general, if c is a clique set in a graph representing the connection between random variables y _i , the conditional random field is a log-linear model of the potential function of each clique.

ここでは、系列ラベリング問題に特化した記述なので、系列ラベリングの場合のクリークのみを議論する。系列ラベリングの場合は、図５に示したようなラティス状の出力候補グラフを用いるので、クリークはｙ_ｉ−１，ｙ_ｉのラベルペアで構成されることになる。
ここで、図７は、条件付き確率場におけるクリークを説明する図面である。クリークとは、グラフ中の部分グラフのうち、完全グラフとなるノードの集合である。図７に示すように、ラティス状の出力候補グラフにおいて、隣り合う二つのノードのみが、クリークとなる。このため、リンクで結合された全ての隣り合うノードがクリークとなる。
なお、完全グラフとは、グラフ中のノードが自身以外の全てのノードとのリンクを持つ場合をいう。 Here, since the description is specialized for the sequence labeling problem, only cliques in the case of sequence labeling will be discussed. In the case of series labeling, since a lattice-like output candidate graph as shown in FIG. 5 is used, the clique is composed of y _i−1 , y _i label pairs.
Here, FIG. 7 is a diagram for explaining a clique in a conditional random field. A clique is a set of nodes that become a complete graph among subgraphs in a graph. As shown in FIG. 7, in the lattice-like output candidate graph, only two adjacent nodes become cliques. For this reason, all adjacent nodes connected by links become cliques.
Note that a complete graph refers to a case where a node in the graph has links to all nodes other than itself.

図７からもわかるように、系列ラベリングでは、一つのパス中のクリーク数は系列の長さｎ＋１(ＢＯＳ（開始）ノード，ＥＯＳ（終端）ノードを含むため)となる。
このとき、条件付き確率場において、任意のクリークｃ_ｉでの局所的な特徴をｆ（ｙ，ｘ，ｉ）と表し、λを特徴に対応したパラメタベクトルとする。ここで、図８は条件付き確率場におけるクリーク毎の特徴の例を示す図面である。このとき、条件付き確率場上の任意の出力ｙに対する条件付き確率は以下のように表すことができる。 As can be seen from FIG. 7, in sequence labeling, the number of cliques in one path is the sequence length n + 1 (because it includes BOS (start) nodes and EOS (end) nodes).
At this time, in the conditional random field, a local feature at an arbitrary clique c _i is represented as f (y, x, i), and λ is a parameter vector corresponding to the feature. Here, FIG. 8 is a drawing showing an example of features for each clique in the conditional random field. At this time, the conditional probability for an arbitrary output y on the conditional random field can be expressed as follows.

数式（２）において、Ｚ_λ（ｘ）は正規化項を意味し、全ての出力の可能性の総和である。つまり条件付き確率場での条件付き確率分布ｐ（ｘ｜ｙ）は、出力系列ｙの各地点ｉでの局所的な特徴ｆ（ｙ，ｘ，ｉ）の重みを指数とした値の総積を、全ての可能性の出力系列の総和で割ったものとなる。
このとき、数式（１）は、次のように書き直すことができる。 In equation (2), Z _λ (x) means a normalization term and is the sum of all output possibilities. That is, the conditional probability distribution p (x | y) in the conditional random field is the sum of the values with the weight of the local feature f (y, x, i) at each point i in the output sequence y as an index. Divided by the sum of all possible output sequences.
At this time, Formula (1) can be rewritten as follows.

前記の数式（３）から、一つのパス全体に対する大域的な特徴はＦを用いて次のように書き表すことができる。 From the above equation (3), the global feature for one entire path can be written using F as follows.

つまり、系列の大域的な特徴は、各地点での局所的な特徴の総和で表されることを意味する。この大域的特徴Ｆに従うと、数式（１）の条件付き確率場による条件付き確率は以下のように書き直すことができる。 That is, it means that the global feature of the series is represented by the sum of local features at each point. According to this global feature F, the conditional probability by the conditional random field of Equation (1) can be rewritten as follows:

任意の出力系列ｙの確率は、出力候補グラフ中の出力系列が表すパス上のクリークから抽出される特徴とパラメタベクトルとの内積を、指数関数を用いて変換した値を、可能な全ての出力系列全体の値で割ったものとなる。
条件付き確率場上で、入力系列ｘに対して最も確率の高い出力系列

は、次式により推定できる。 The probability of an arbitrary output sequence y is the output of all possible output values obtained by converting the inner product of the feature vector extracted from the clique on the path represented by the output sequence in the output candidate graph using an exponential function. Divided by the value of the entire series.
The output sequence with the highest probability for the input sequence x on the conditional random field

Can be estimated by the following equation.

ここで、ｙは入力系列ｘに対する出力系列の候補の集合を表す。また、数式（５）中の正規化項Ｚ_λ（ｘ）はｙに依存しない値なので、最尤出力（argmax）を得る際には影響を及ぼさないので用いる必要はない。このように、数式（７）を用いて入力系列ｘを条件付き確率場に与えたときに、最も確率の高い出力系列ｙを得ることができる。 Here, y represents a set of output sequence candidates for the input sequence x. Further, since the normalized term Z _λ (x) in the equation (5) does not depend on y, it does not have any influence when obtaining the maximum likelihood output (argmax), so it is not necessary to use it. Thus, when the input sequence x is given to the conditional random field using Equation (7), the output sequence y with the highest probability can be obtained.

条件付き確率場の学習では、最尤推定に基づいて学習データ

からの学習をおこなう。 In conditional random field learning, the training data is based on maximum likelihood estimation.

Learn from.

ここで、ｙ^＊は正解を表す。 Here, y ^* represents a correct answer.

一般的に、確率モデルでは実用上の観点から、確率の対数（対数尤度）を最大にするように学習を行う。これは、対数をとっても、最大値をとるパラメタの値は同じになる（確率と対数尤度は比例関係にある）ため、このような変換が可能である。これを一般的に、尤度最大化（Maximum Likelihood）学習と呼び、条件付き確率場上の学習では、以下の目的関数を最大化することになる。 In general, in a probability model, learning is performed so as to maximize the logarithm (log likelihood) of a probability from a practical viewpoint. This is because even if the logarithm is taken, the parameter value that takes the maximum value is the same (probability and log likelihood are in a proportional relationship), and thus such conversion is possible. This is generally called maximum likelihood learning, and the following objective function is maximized in learning on a conditional random field.

目的関数である数式（９）の最適解は、目的関数のパラメタλによる微分が０になる点を探せばよい。 The optimal solution of Equation (9), which is the objective function, may be found by finding the point at which the differentiation by the parameter λ of the objective function becomes zero.

ただし、

は、大域的特徴Ｆの期待値を意味する。ここで用いる目的関数、数式（９）は、凸関数であるので唯一の最適解を持つ。実際の最適化には、最急降下法やニュートン法などの一般的な数値最適化アルゴリズムを用いて効率的に解くことができる。
J.Lafferty, A. MaCallum, and F.Pereira. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proc. of ICML-2001, pages 282-289, 2001. F. Sha and F. Pereira. Shallow Parsing with Conditiona1 Random Fields. In Proc. of HLT/NACCL-2003, pages 213-220, 2003. Jun Suzuki， Erik McDermott, and Hideki Isozaki. Training Conditiona1 Random Fields with Multivariate Evaluation Measures. In Association for Computational Linguistics(ACL), pages 217-224, 2006. 磯崎秀樹鈴木潤，誤り最小化に基づく条件付き確率場の学習:言語解析への適用，言語処理学会第１２回年次大会，2006年 However,

Means the expected value of the global feature F. Since the objective function used here, Equation (9), is a convex function, it has only one optimal solution. In actual optimization, it can be solved efficiently by using a general numerical optimization algorithm such as steepest descent method or Newton method.
J. Lafferty, A. MaCallum, and F. Pereira. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data.In Proc. Of ICML-2001, pages 282-289, 2001. F. Sha and F. Pereira. Shallow Parsing with Conditiona1 Random Fields. In Proc. Of HLT / NACCL-2003, pages 213-220, 2003. Jun Suzuki, Erik McDermott, and Hideki Isozaki. Training Conditiona1 Random Fields with Multivariate Evaluation Measures. In Association for Computational Linguistics (ACL), pages 217-224, 2006. Hideki Amagasaki Jun Suzuki, Learning Conditional Random Fields Based on Error Minimization: Application to Linguistic Analysis, 12th Annual Conference of the Association for Natural Language Processing, 2006

前記のように、一般的には、条件付き確率場では、尤度最大化学習を用いて学習(パラメタベクトルλの推定)をおこなう。しかし、非特許文献３及び非特許文献４において、条件付き確率場を尤度最大化による学習(パラメタ推定)ではなく、実際に解きたいタスクの評価指標を基準として最大化(或は最小化)する学習手法が提案されている。 As described above, generally, in the conditional random field, learning (estimation of the parameter vector λ) is performed using likelihood maximization learning. However, in Non-Patent Document 3 and Non-Patent Document 4, the conditional random field is maximized (or minimized) based on the evaluation index of the task to be solved, rather than learning (parameter estimation) by likelihood maximization. A learning method has been proposed.

条件付き確率場によって学習したモデルを実問題に用いる際の性能評価は、対象とする問題に特化した評価指標を用いて性能を評価する場合がほとんどである。しかし、尤度最大化学習は、あくまで尤度(出力確率の対数)を最大化する学習であって、学習後の対象タスクの性能を最大化することは保証していない。つまり、非特許文献３及び非特許文献４で提案されている「評価指標最大化学習」は、この学習時に最適化する目的関数と、実際にタスク評価に用いる評価指標との不整合を解消することを可能としている。 Performance evaluation when using a model learned by a conditional random field for a real problem is almost always performed by using an evaluation index specialized for the target problem. However, likelihood maximization learning is learning that maximizes the likelihood (logarithm of the output probability), and does not guarantee that the performance of the target task after learning is maximized. In other words, “evaluation index maximization learning” proposed in Non-Patent Document 3 and Non-Patent Document 4 eliminates inconsistencies between the objective function optimized during learning and the evaluation index actually used for task evaluation. Making it possible.

具体的な例をあげる。系列セグメンテーションでは、各セグメント単位でのＦ値という指標を用いて系列セグメンテーションの性能を評価する。 A specific example is given. In the sequence segmentation, the performance of the sequence segmentation is evaluated using an index called F value in each segment unit.

数式（１１）において、ＴＰ，ＦＰ，ＦＮは、それぞれtrue positive（正例の正解数），false positive（負例の正例への間違い数），false negative（正例の負例への間違い数）のセグメント数を表す。また、γは、Ｆ値を計算する上で、再現率と適合率とのトレードオフの度合いを決定するハイパーパラメータであり、目的とするタスクに合わせて人手により設定する値である。一般的には、γ＝１を用いてＦ値を計算する。 In Equation (11), TP, FP, and FN are true positive (number of correct answers for positive examples), false positive (number of errors for positive examples of negative examples), and false negative (number of errors for negative examples of positive examples), respectively. ) Represents the number of segments. In addition, γ is a hyperparameter that determines the degree of trade-off between the recall rate and the matching rate in calculating the F value, and is a value that is manually set according to the target task. In general, the F value is calculated using γ = 1.

Ｆ値とは、主に正例・負例のように２種類の出力を前提に考えられた評価指標であり、「正例（positive）」を評価対象としたい出力のクラスとし、「負例（negative）」はそれ以外（評価対象としない）の出力のクラスとする。系列セグメンテーションの場合は、ＩＯＢラベリングのＯが付与されている場合を負例と考え、それ以外のラベルを全て正例として扱う。つまり、ＴＰやＦＮの最大数は正例セグメント数となり、ＦＰは全セグメント数が最大数となる。 The F value is an evaluation index that is considered on the premise of two types of outputs, such as positive examples and negative examples. The “positive example” is the output class that is to be evaluated, and the “negative example” “(Negative)” is the other (not evaluated) output class. In the case of series segmentation, the case where O of IOB labeling is given is considered as a negative example, and all other labels are handled as positive examples. That is, the maximum number of TPs and FNs is the number of positive segments, and the total number of segments of FP is the maximum number.

数式（１１）と数式（９）とを比べてもわかるように、Ｆ値と系列の尤度にはある程度の相関関係はあるにせよ、比例関係のような密接な関係にあるような指標ではない。具体的な対比としては、数式（９）の右辺は、サンプル毎の尤度の総和（線形）をもちいるが、数式（１１）の右辺では、単純な総和ではなく正例・負例を考慮した分数の形（非線形）の式になっている。また、数式（９）は、系列全体を一単位として評価を行っているが、数式（１１）では、セグメントを一単位として評価を行っている。
このように、系列セグメンテーション問題において、尤度最大化学習によって学習を行ったとしても、解決したい問題の性能を最大にする学習をしているとは限らない。非特許文献３及び非特許文献４においては、通常の尤度最大化学習をおこなうよりも解決したい問題の評価指標を最大化する学習をおこなうことによって、さらに性能が向上することが示されている。 As can be seen from the comparison between Equation (11) and Equation (9), although there is a certain degree of correlation between the F value and the likelihood of the sequence, an indicator that is closely related such as a proportional relationship Absent. As a specific comparison, the right side of Equation (9) uses the sum of likelihood (linear) for each sample, but the right side of Equation (11) considers positive and negative examples instead of simple summation. This is a fractional (non-linear) formula. In addition, the numerical formula (9) is evaluated with the whole series as one unit, but the numerical formula (11) is evaluated with the segment as one unit.
As described above, in the sequence segmentation problem, even if learning is performed by likelihood maximization learning, learning that maximizes the performance of the problem to be solved is not necessarily performed. Non-Patent Document 3 and Non-Patent Document 4 show that the performance is further improved by performing learning that maximizes the evaluation index of the problem to be solved rather than performing normal likelihood maximization learning. .

また、この枠組を用いることで、対象とする実問題の評価関数が「正解（又は誤り）の数」を基準として計算するような評価関数であれば、どのような評価関数であっても、条件付き確率場の学習に用いる目的関数として導入することが可能となる。つまり、一般的に用いられるような実問題の評価指標は基本的に正解（又は誤り）の数を基準として計算されるため、この枠組を用いることで、現在用いられているほとんどの評価指標を用いて条件付き確率場を学習することが可能となる。 In addition, by using this framework, any evaluation function can be used as long as the evaluation function of the target actual problem is calculated based on the “number of correct answers (or errors)”. It can be introduced as an objective function used for learning of a conditional random field. In other words, generally used evaluation indicators for actual problems are basically calculated based on the number of correct answers (or errors), so by using this framework, most of the currently used evaluation indicators can be calculated. It is possible to learn a conditional random field using this.

「評価指標最大化学習」では、扱いたい評価指標によって学習時の最適化法を考案しなくてはいけない。評価指標最大化学習で、ある評価指標についての最適化法（アルゴリズムを含む）を考えるにあたり、評価指標が利用する（最小）評価単位をはじめに定義する必要がある。ここでは、系列ラベリング又はセグメンテーションを対象にしているので、評価単位としては主に（１）系列全体，（２）セグメント，（３）ラベル単体という３種類が考えられる。 In “evaluation index maximization learning”, an optimization method at the time of learning must be devised according to the evaluation index to be handled. In the evaluation index maximization learning, when considering an optimization method (including an algorithm) for a certain evaluation index, it is necessary to first define the (minimum) evaluation unit used by the evaluation index. Here, since series labeling or segmentation is targeted, there are mainly three types of evaluation units: (1) whole series, (2) segments, and (3) single labels.

基本的な手段として、評価指標最大化学習では、評価単位毎の正解出力ｙ^＊と正解以外の出力候補ｙ（ｙ≠ｙ^＊）の差分を用いる。そして、この評価単位毎の差分を対象とする評価指標に基づいて最大化するように学習をおこなう。つまり、最も間違えやすい出力候補と正解出力との差を扱う評価指標に基づいて、その差分を大きくするということは、最終的な評価指標を最大化することになる。 As a basic means, the evaluation index maximization learning uses a difference between the correct output y ^{* for} each evaluation unit and the output candidate y (y ≠ y ^* ) other than the correct answer. Then, learning is performed so as to maximize the difference for each evaluation unit based on the evaluation index. In other words, increasing the difference based on the evaluation index that handles the difference between the output candidate that is most likely to be mistaken and the correct output maximizes the final evaluation index.

条件付き確率場で系列全体としての出力ｙの決定関数ｇ（）は、特徴ベクトルとパラメタベクトルの内積として次のように与えられる。 The decision function g () of the output y as a whole sequence in the conditional random field is given as the inner product of the feature vector and the parameter vector as follows.

決定関数とは、入力ｘとパラメタベクトルλを与えたときに、ある出力ｙが出力される尤度(またはスコア)を表す関数である。ただし、ここでの出力ｙは、評価の単位が系列全体の場合を表している。また、右辺の総和ｉは単一のクリーク単位での特徴ベクトルの総和を表している。このとき誤り推定関数は、決定関数を用いて次のように定義できる。 The decision function is a function representing the likelihood (or score) that a certain output y is output when an input x and a parameter vector λ are given. However, the output y here represents the case where the unit of evaluation is the entire series. The sum i on the right side represents the sum of feature vectors in a single clique unit. At this time, the error estimation function can be defined as follows using a decision function.

誤り推定関数とは、学習データに付けられた正解と正解以外の出力と比較してどの程度、間違えやすいかという推定量を出力する関数である。ここでは、正解と正解以外の出力候補との差分で表す。つまり、最も間違えやすい出力候補と正解との決定関数の差分が大きければ大きい程、間違えにくいということを意味する推定量である。 The error estimation function is a function that outputs an estimation amount indicating how much the error is likely to be mistaken compared to the correct answer attached to the learning data and the output other than the correct answer. Here, it represents with the difference of an output candidate other than a correct answer and a correct answer. In other words, this is an estimated amount that means that the larger the difference between the decision function between the output candidate that is most likely to be mistaken and the correct answer, the smaller the mistake.

最終的には、この誤り推定関数ｄ（）を用いて利用したい評価指標に基づいた目的関数を構成する。つまり、評価時の評価指標の計算を、学習時には、誤り推定関数の返す値を利用して評価指標を計算するという手順になる。ただし、数式（１３）を用いる場合、誤り推定関数は［-∞，∞］の値域をとるが、実際の評価指標では［０，１］の値域のみを利用したい場合が考えられる。例えば、正解率で評価する場合は、誤りの数で評価指標が構成されているので、正解を１，誤りを０として計算したい。そこで、一般的には平滑化関数を用いて値域を変換する。平滑化関数の例として、シグモイド関数等がある。ここで、図９は、平滑化関数の例を示す図面である。図９には、平滑化関数の例として、ステップ関数、シグモイド関数及びロジステック関数をあげる。
ただし、平滑化関数は計算不能等を引き起こさないための一つの手段であり、評価指標最大化学習に必須の手法や条件というわけではない。 Finally, an objective function based on the evaluation index to be used is constructed using this error estimation function d (). That is, the evaluation index is calculated at the time of evaluation, and at the time of learning, the evaluation index is calculated using the value returned by the error estimation function. However, when Equation (13) is used, the error estimation function takes a range of [−∞, ∞], but in the actual evaluation index, only the range of [0, 1] may be used. For example, when evaluating with the correct answer rate, since the evaluation index is composed of the number of errors, it is desired to calculate the correct answer as 1 and the error as 0. Therefore, generally, the range is converted using a smoothing function. An example of the smoothing function is a sigmoid function. Here, FIG. 9 is a diagram illustrating an example of the smoothing function. FIG. 9 shows a step function, a sigmoid function, and a logistics function as examples of the smoothing function.
However, the smoothing function is one means for preventing incompatibility and the like, and is not an essential method or condition for evaluation index maximization learning.

評価単位が系列全体に対する評価指標を扱う場合には、数式（１３）のような誤り推定関数ｄ（）を用いればよい。系列全体を扱う時と同様に、セグメント単位での誤り推定関数ｄ_{［ｉ，ｊ］}（）（位置ｉ〜ｊまでの区間の場合）を定義すればよい。
ここで、図１０は、非特許文献３及び非特許文献４に記載の誤り推定計算方法を説明する図面である。図１０に記載の方法では、正解パスと正解パス以外で最も出力されやすいパスとの差分を計算し、このセグメント単位での誤り推定関数を系列全体の誤り推定関数を用いて、近似的に計算をおこなっている。
ただし、正解パス以外で最も出力されやすいパスとは、対象セグメントｙ_２ｙ_４間で、正解パスと一致しないパスの中から選択される。 When the evaluation unit deals with an evaluation index for the entire sequence, an error estimation function d () as expressed by Equation (13) may be used. Similar to the case of handling the entire sequence, an error estimation function d _{[i, j]} () (in the case of the section from position _{i to j} ) may be defined in units of segments.
Here, FIG. 10 is a diagram for explaining the error estimation calculation methods described in Non-Patent Document 3 and Non-Patent Document 4. In the method shown in FIG. 10, a difference between a correct answer path and a path that is most likely to be output other than the correct answer path is calculated, and an error estimation function for each segment is approximately calculated using the error estimation function for the entire sequence. Is doing.
However, the path that is most likely to be output other than the correct path is selected from paths that do not match the correct path between the target segments y ₂ y ₄ .

このように、系列全体の誤り推定関数を利用してセグメントの誤り推定関数を構成しているのは、条件付き確率場が、元来系列全体に対する尤度最大化学習という枠組を提供していることから、系列全体に対する決定関数ｇ（）や誤り推定関数ｄ（）に対する効率的な計算法が確立されていたという背景がある。また逆に、評価指標最大化学習の枠組での系列ラベリング／セグメンテーションでは、セグメント単位での誤り推定関数を厳密に計算する方法・アルゴリズムがこれまで発明されていなかったため、近似的な方法を用いていたという背景もある。つまり、非特許文献３及び非特許文献４では、条件付き確率場の学習時に用いる計算アルゴリズムとほぼ同じアルゴリズムを用いることで、セグメント単位の誤り推定関数の計算を近似的に実現していた。 As described above, the error estimation function of the segment is configured using the error estimation function of the entire sequence. The conditional random field originally provides a framework of likelihood maximization learning for the entire sequence. Therefore, there is a background that an efficient calculation method for the decision function g () and the error estimation function d () for the entire sequence has been established. Conversely, in sequence labeling / segmentation in the evaluation index maximization learning framework, a method / algorithm for strictly calculating an error estimation function for each segment has not been invented so far, and an approximate method is used. There is also a background that. That is, in Non-Patent Document 3 and Non-Patent Document 4, the calculation of the error estimation function in units of segments is approximately realized by using almost the same algorithm as that used in learning of the conditional random field.

しかし、一方で、従来技術のように、系列全体に対する誤り推定関数を近似的に用いると、本来最適化したいセグメント単位の評価指標を真に最大化しているという保証はない。
このことを、従来技術の問題点を説明する図面である図１１を参照しつつ説明する。図１１に示した出力候補グラフの対象区間［ｉ＝２，ｊ＝２］において、パス単位の値を用いた場合には、ｃ−ｌ−ｏ：３．８であるＣノードが出力されるが、セグメント単位の値を用いた場合には、セグメント尤度が８．８のＢノードが出力されてしまう。 However, on the other hand, if the error estimation function for the entire sequence is approximately used as in the prior art, there is no guarantee that the segment-by-segment evaluation index to be optimized is truly maximized.
This will be described with reference to FIG. 11, which is a drawing for explaining the problems of the prior art. In the target section [i = 2, j = 2] of the output candidate graph shown in FIG. 11, when a path unit value is used, a C node of c−l−o: 3.8 is output. However, when a segment unit value is used, a B node having a segment likelihood of 8.8 is output.

セグメント単位で見た場合、セグメントの出力確率と系列の出力確率に基づくセグメントの出力確率には、相関はあるが一致はしないから起こる不整合である。つまり、厳密な意味で、本来の目的である評価指標を最大化していない可能性がある。 When viewed in segment units, there is a mismatch between the segment output probability and the segment output probability based on the sequence output probability because there is a correlation but no match. That is, there is a possibility that the evaluation index that is the original purpose is not maximized in a strict sense.

以上の問題を解決するため、本発明は、パラメタベクトルを推定する際に、評価単位（セグメント単位、ラベル単位や系列全体の場合も含む)の誤り推定関数を厳密に求め、従来法で用いる近似的な計算法と同じ計算量で実現可能な手段を提供することを目的とする。 In order to solve the above problems, the present invention accurately calculates an error estimation function of an evaluation unit (including a segment unit, a label unit, or an entire sequence) when estimating a parameter vector, and uses an approximation used in the conventional method. It is an object to provide a means that can be realized with the same amount of calculation as a typical calculation method.

前記の課題を解決するためになされた本発明に係る言語解析モデル学習装置は、テキストデータに所定の分類タグを付与するために使用するパラメタベクトルを、予め分類タグが付与されたテキストデータである学習データから算出するものであって、学習データに分類タグの種類を示す出力ラベル候補を対応付けた出力候補グラフと、パラメタベクトルの初期値とを記憶した記憶部と、学習データの所定の評価単位の出現確率を、パラメタベクトルを変数として周辺確率により表す決定関数を用いて、学習データの正解の要素の出現確率と、正解の要素以外で最も出力されやすい要素の出現確率との差分を誤り推定関数として計算し、推定関数を用いて設定される目的関数を最適化するパラメタベクトルを算出するパラメタ学習部とを備えることを特徴としている。
本発明に係る言語解析モデル学習装置によると、学習データの評価単位の要素の厳密な出現確率に基づく評価指標最大化学習を行うことができる。 The language analysis model learning apparatus according to the present invention, which has been made to solve the above-mentioned problem, is text data in which a classification tag is assigned in advance to a parameter vector used for assigning a predetermined classification tag to text data. A storage unit that is calculated from learning data and stores an output candidate graph in which output label candidates indicating classification tag types are associated with the learning data, an initial value of the parameter vector, and a predetermined evaluation of the learning data Using a decision function that expresses the occurrence probability of a unit as a variable with a parameter vector as a marginal probability, the difference between the occurrence probability of the correct element in the training data and the appearance probability of the element that is most likely to be output other than the correct element is incorrect. A parameter learning unit that calculates a parameter vector that is calculated as an estimation function and that optimizes an objective function that is set using the estimation function. It is characterized in that.
According to the language analysis model learning device of the present invention, it is possible to perform evaluation index maximization learning based on the strict appearance probability of elements of evaluation units of learning data.

なお、本発明のその他の態様については、後記する最良の形態において詳しく説明する。 Other aspects of the present invention will be described in detail in the best mode described later.

本発明によると、従来技術と比較して様々な点が改善される。
厳密な出現確率に基づく評価指標最適化学習を行うことが可能となるため、最適化したい評価指標の意味で最良の学習結果を得ることが可能となる。従来技術は、目的関数を評価指標に置き換えるための方法論を示したが、実際に評価単位での出力確率を求める際には、従来から用いられてきた計算アルゴリズムを流用して近似値を用いていた。このため、枠組としては、評価指標最大化を行う学習法であったが、実際には評価指標を最大化していない可能性を含んでいた。一方、本発明はその問題点を改善し、厳密な出力確率に基づいて評価指標を最適化することを可能とする。 According to the present invention, various points are improved as compared with the prior art.
Since it is possible to perform evaluation index optimization learning based on a strict appearance probability, it is possible to obtain the best learning result in the meaning of the evaluation index desired to be optimized. The prior art has shown a methodology for replacing the objective function with an evaluation index. However, when actually obtaining the output probability in the evaluation unit, an approximate value is used by diverting a calculation algorithm that has been used conventionally. It was. For this reason, the framework is a learning method that maximizes the evaluation index, but there is a possibility that the evaluation index is not actually maximized. On the other hand, the present invention improves the problem and makes it possible to optimize the evaluation index based on a strict output probability.

また、評価単位の出力確率を求めることができるため、この出力結果を用いてさらに処理を行う際に、出力値を確率として扱うことが可能となる。例えば、本発明を用いた固有表現抽出を行って、それを情報検索に用いる場合、出力された固有表現がどのくらい信頼性があるかといった指標として用いることができたり、確率論に基づいた情報検索システムであれば、本発明を用いた固有表現システムをそのまま情報検索システムと結合することが可能となる。従来法では、出力値が確率の定義を満たさない可能性があるため、このような状況で、そのまま出力値を扱うことができなかった。 Further, since the output probability of the evaluation unit can be obtained, the output value can be treated as a probability when further processing is performed using this output result. For example, when performing a specific expression extraction using the present invention and using it for information retrieval, it can be used as an index of how reliable the output specific expression is, or information retrieval based on probability theory If it is a system, the named entity system using the present invention can be directly coupled to the information retrieval system. In the conventional method, there is a possibility that the output value does not satisfy the definition of the probability. Therefore, in such a situation, the output value cannot be handled as it is.

以下、本発明の方式を説明し、その後、その方式を適用した発明を実施するための最良の形態（以下、実施形態）について説明する。 Hereinafter, the system of the present invention will be described, and then the best mode (hereinafter referred to as an embodiment) for carrying out the invention to which the system is applied will be described.

（言語モデルの学習方法の概略）
ここで、図１２は、あるセグメントの出力確率を説明する図面である。
図１２に示す出力候補グラフにおいて、出力系列ｙ中の位置ｉから位置ｊまでの間のセグメントの（部分）出力系列ｙ_{［ｉ，ｊ］}の出力確率は、周辺確率を用いて以下のように定義することができる。 (Outline of learning method of language model)
Here, FIG. 12 is a diagram for explaining the output probability of a certain segment.
In the output candidate graph shown in FIG. 12, the output probability of the (partial) output sequence y _{[i, j]} of the segment between the position i and the position j in the output sequence y is as follows using the marginal probabilities: Can be defined.

数式（１５）において、

は、セグメント［ｉ，ｊ］間のパスを通る全てのパスの重みの総和を表している。また、α_{［０，ｉ］}（ｙ_ｉ）及びβ_{［ｊ，ｎ］}（ｙ_ｊ）は、それぞれ、前側確率及び後側確率であり、［ｉ，ｊ］間の周辺確率を表している。図１２に示した例では、α_{［０，ｉ］}（ｙ_ｉ）は、セグメント［０，２］の確率を、β_{［ｊ，ｎ］}（ｙ_ｊ）は、セグメント［４，６］の確率をそれぞれ表す。そして、前側確率は、次に示す数式（１６）のように定義される。 In Formula (15),

Represents the sum of the weights of all the paths passing through the path between the segments [i, j]. Further, α _{[0, i]} (y _i ) and β _{[j, n]} (y _j ) are a front probability and a rear probability, respectively, and represent a peripheral probability between [i, j]. In the example shown in FIG. 12, α _{[0, i]} (y _i ) is the probability of segment [0, 2], and β _{[j, n]} (y _j ) is the probability of segment [4, 6]. Respectively. The front probability is defined as in the following formula (16).

ただし、ｉ＝０のときは、α_{［０，０］}（ｙ_０）＝1とする。また、Ｙを単一のラベルの(とりうる全てのラベル)集合とする。同様に後側確率も、次に示す数式（１７）のように定義される。 However, when i = 0, α _{[0, 0]} (y ₀ ) = 1. Also, let Y be a set of single labels (all possible labels). Similarly, the rear probability is also defined as the following formula (17).

ただし、ｊ＝ｎのときは、β_{［ｎ，ｎ］}（ｙ_ｎ）＝１とする。
数式（１５）の分母は、ｙに依存しないので、最終的にセグメント単位の出現確率を基準にしたセグメント単位の決定関数は次のように表せる。 However, when j = n, β _{[n, n]} (y _n ) = 1.
Since the denominator of Equation (15) does not depend on y, the segment unit decision function based on the appearance probability of the segment unit can be expressed as follows.

よって、このセグメント出現確率に基づいた決定関数を用いると、誤り推定関数ｄ（）は次のように表せる。 Therefore, using a decision function based on this segment appearance probability, the error estimation function d () can be expressed as follows.

ここで、図１３は、図１２に示す出力候補グラフにおける数式（１９）の意味を説明する図面である。図１３に示しように、数式（１９）における−ｇ_{［ｉ，ｊ］}（ｙ^＊，ｘ，λ）の項は、正解セグメントの出現確率を示し、それ以下の項は、正解以外で最も出力されやすいセグメントの出現確率を示している。ただし、正解以外で最も出力されやすいセグメントとは、対象区間［２，４］間で正解セグメントと一致しないセグメントの中から選択される。 Here, FIG. 13 is a diagram for explaining the meaning of Equation (19) in the output candidate graph shown in FIG. As shown in FIG. 13, the term of −g _{[i, j]} (y ^* , x, λ) in the equation (19) indicates the appearance probability of the correct segment, and the term below it is the most output other than the correct answer. It shows the appearance probability of the segment that is easy to be performed. However, the segment that is most likely to be output other than the correct answer is selected from the segments that do not match the correct segment in the target section [2, 4].

周辺確率を用いることで、系列中から対象セグメントが選択される確率を厳密に求めることが可能となる。これにより、誤り推定関数は、正解セグメントが出力される確率とそれ以外のセグメントが出力される確率との差分を利用して学習を行うことになる。 By using the marginal probabilities, it is possible to strictly determine the probability that the target segment is selected from the series. Thereby, the error estimation function performs learning using the difference between the probability that the correct segment is output and the probability that other segments are output.

誤り推定関数として、従来技術の数式（１３）と数式（１９）とを比較すると、表層的には従来技術でセグメント単位の決定関数を系列全体の値で近似していたものを、セグメント単位で正確に計算した値に置き換えた形になる。ただし、決定関数として、従来技術の数式（１２）と数式（１８）とを比較すると、従来技術では系列の重みを用いていたのに対して、本発明では周辺確率用いたセグメントの出現確率に基づいた重みを用いているという違いがわかる。 Comparing Equation (13) and Equation (19) of the prior art as an error estimation function, the surface layer has approximated the segment unit decision function by the value of the entire series in the surface layer. It becomes the form replaced with the accurately calculated value. However, comparing the formula (12) and formula (18) of the prior art as a decision function, the weight of the sequence is used in the prior art, whereas in the present invention, the segment appearance probability using the marginal probability is used. You can see the difference of using the weight based on.

実際の計算アルゴリズムとしては、α_０，ｋ（ｙ_ｋ）やβ_ｌ，ｎ（ｙ_ｌ）は、forward-backwardアルゴリズムを用いることにより効率的に計算することが可能である。つまり、従来技術では、Viterbiアルゴリズムを前からと後ろからの２パスで計算することで、系列全体の出力確率から対象セグメントの決定関数と誤り推定関数とを近似していたが、本発明では、その代わりにforward-backwardアルゴリズムを用いて、セグメント単位の決定関数や誤り推定関数を計算する。また、従来技術に係る２パスViterbiアルゴリズムと、本発明に係るforward-backwardアルゴリズムの計算量は全く同じになることから、本発明によると、従来技術と同じ計算量で厳密なセグメント単位の出力確率に基づいた最適化が可能となる。 As an actual calculation algorithm, α _{0, k} (y _k ) and β _{l, n} (y _l ) can be efficiently calculated by using a forward-backward algorithm. That is, in the prior art, the Viterbi algorithm is calculated in two passes from the front and the back, and the target segment decision function and the error estimation function are approximated from the output probability of the entire sequence. Instead, the forward-backward algorithm is used to calculate the segment unit decision function and error estimation function. In addition, since the calculation amount of the two-pass Viterbi algorithm according to the prior art and the forward-backward algorithm according to the present invention are exactly the same, according to the present invention, the exact output probability of the segment unit with the same calculation amount as the prior art. Optimization based on the above becomes possible.

（テキスト解析装置）
次に、前記の方式を適用した本発明の実施形態に係るテキスト解析装置について説明する。図１４は本実施形態のテキスト解析装置のブロック構成図である。図１４に示すように、テキスト解析装置１は、演算装置であるＣＰＵ（Central Processing Unit）１１、後記するプログラムが展開され、一時記憶されるデータが保持されるＲＡＭ（Random Access Memory）１２、外部との入出力インタフェースである入出力部１３及びハードディスクドライブ等からなるストレージ１４とを相互に接続して構成され、パーソナルコンピュータ等を用いて具現される。 (Text analyzer)
Next, a text analysis apparatus according to an embodiment of the present invention to which the above method is applied will be described. FIG. 14 is a block diagram of the text analysis apparatus of this embodiment. As shown in FIG. 14, the text analysis apparatus 1 includes a CPU (Central Processing Unit) 11 that is an arithmetic unit, a RAM (Random Access Memory) 12 in which a program to be described later is expanded and temporarily stored data is held, an external The input / output unit 13 and the storage 14 including a hard disk drive are connected to each other, and are implemented using a personal computer or the like.

また、テキスト解析装置１のストレージ１４には、ＣＰＵ１１が実行することで、それぞれ学習器１４１、推定器１４２及びタグ／セグメント変換器１４３として動作するプログラムと、各プログラムが使用又は生成する学習データ１４４、パラメタベクトル１４５、生テキスト１４６、タグ付きテキスト１４７及びセグメント別タグ付きテキスト１４８とが記憶されている。
なお、各プログラムの動作及び各データの内容については後記する。また、ストレージ１４に記憶された各プログラム及び各データは、コンピュータで読み取り可能な各種の記録媒体（ＣＤ−ＲＯＭ等）に記憶することができる。 Also, the storage 14 of the text analysis apparatus 1 has a program that operates as a learning device 141, an estimator 142, and a tag / segment converter 143, and learning data 144 that is used or generated by each program. , Parameter vector 145, raw text 146, tagged text 147 and segmented tagged text 148 are stored.
The operation of each program and the contents of each data will be described later. Each program and each data stored in the storage 14 can be stored in various computer-readable recording media (CD-ROM or the like).

（テキスト解析装置の動作の概略）
次に、図１５は、前記の構成を有するテキスト解析装置１の処理動作の概略を説明する説明図である。図１５を参照しつつ、テキスト解析装置１の処理動作を説明する。 (Outline of operation of text analysis device)
Next, FIG. 15 is an explanatory diagram for explaining the outline of the processing operation of the text analysis apparatus 1 having the above-described configuration. The processing operation of the text analysis device 1 will be described with reference to FIG.

テキスト解析装置１の動作は、タグ付きコーパスである学習データに基づいて、パラメタベクトルを計算する学習フェーズと、計算されたパラメタベクトルを用いて生テキストにタギンク又はチャンキングを行う評価フェーズとに大きく分けられる。 The operation of the text analysis apparatus 1 is largely divided into a learning phase in which a parameter vector is calculated based on learning data that is a tagged corpus and an evaluation phase in which tagging or chunking is performed on the raw text using the calculated parameter vector. Divided.

図１５に示すように、学習フェーズでは、学習器１４１が、タグ付きコーパスである学習データ１４４の入力を受けて、パラメタベクトル１４５を出力する。
そして、評価フェーズに移って、推定器１４２が、この出力されたパラメタベクトル１４５を用いて、生テキスト１４６に対してタギングを実行し、タグ付きテキスト１４７を出力する。そして、タグ／セグメント変換器１４３が、推定器１４２が出力したタグ付きテキスト１４７に対して、チャンキングを実行し、セグメント別タグ付きテキスト１４８を出力する。ここで、参考として、図１６に、学習器１４１に入力される学習データ１４４及び出力されるパラメタベクトル１４５の例を、図１７に、推定器１４２に入力される生テキスト１４６及び出力されるタグ付きテキスト１４７の例をそれぞれ示す。 As shown in FIG. 15, in the learning phase, the learning device 141 receives input of learning data 144 that is a tagged corpus and outputs a parameter vector 145.
Then, in the evaluation phase, the estimator 142 performs tagging on the raw text 146 using the output parameter vector 145 and outputs the tagged text 147. Then, the tag / segment converter 143 performs chunking on the tagged text 147 output from the estimator 142 and outputs the segmented tagged text 148. Here, for reference, FIG. 16 shows an example of learning data 144 input to the learning device 141 and an output parameter vector 145, and FIG. 17 shows raw text 146 input to the estimator 142 and an output tag. Examples of the attached text 147 are shown.

なお、推定器１４２における処理は、学習器１４１における処理において、学習データ１４４の代わりに生テキスト１４６を入力し、パラメタベクトル１４５の初期値の代わりに学習器１４１の計算したパラメタベクトル１４５を入力したものであるため、その詳細な説明は省略する。また、タグ／セグメント変換器１４３における処理は、前記のように公知のアルゴリズムを適用することができるため、同じくその詳細な説明を省略する。
また、図１５に示した動作例では、推定器１４２は、タギングを実行してタグ付きテキスト１４７を出力することとしたが、チャンキングを実行してセグメント付きテキストを出力して、タグ／セグメント変換器１４３において、タギングを実行してセグメント別タグ付きテキスト１４８を出力することにしてもよい。 In the processing in the estimator 142, the raw text 146 is input instead of the learning data 144 in the processing in the learning device 141, and the parameter vector 145 calculated by the learning device 141 is input instead of the initial value of the parameter vector 145. Therefore, detailed description thereof is omitted. Further, since the processing in the tag / segment converter 143 can apply a known algorithm as described above, the detailed description thereof is omitted.
Further, in the operation example shown in FIG. 15, the estimator 142 executes tagging and outputs the tagged text 147. However, the estimator 142 executes chunking to output the segmented text and outputs the tag / segment. The converter 143 may execute tagging and output the segmented tagged text 148.

（学習器の構成・動作）
次に、テキスト解析装置１の構成要素のうち、学習器１４１について詳しく説明する。なお、学習器１４１は、特許請求の範囲の言語解析モデル学習装置に相当し、学習器１４１における処理手順は、言語解析モデル学習方法に相当している。 (Configuration and operation of learning device)
Next, the learning device 141 among the components of the text analysis device 1 will be described in detail. The learning device 141 corresponds to the language analysis model learning device in the claims, and the processing procedure in the learning device 141 corresponds to the language analysis model learning method.

ここで、図１８は、学習器１４１の詳細ブロックを示す図面である。図１８に示すように、学習器１４１は、出力候補グラフ生成部１４１１、特徴抽出部１４１２及びパラメタ学習部１４１３を備えている。
出力候補グラフ生成部１４１１は、学習データ１４４及び入出力部１３から入力された出力ラベル候補に基づいて出力候補グラフを生成する。特徴抽出部１４１２は、入出力部１３から入力された特徴抽出テンプレートを取得し、出力候補グラフ生成部１４１１が生成した出力候補グラフ上のクリーク単位に特徴を抽出し、パラメタベクトル１４５の次元数を計算し、その初期値を設定する。パラメタ学習部１４１３は、出力候補グラフ及び特徴抽出部１４１２が設定したパラメタベクトル１４５の初期値に基づいて、学習基準アルゴリズムにより、パラメタベクトル１４５を学習する。 Here, FIG. 18 is a diagram illustrating a detailed block of the learning device 141. As illustrated in FIG. 18, the learning device 141 includes an output candidate graph generation unit 1411, a feature extraction unit 1412, and a parameter learning unit 1413.
The output candidate graph generation unit 1411 generates an output candidate graph based on the learning data 144 and the output label candidates input from the input / output unit 13. The feature extraction unit 1412 acquires the feature extraction template input from the input / output unit 13, extracts features in units of cliques on the output candidate graph generated by the output candidate graph generation unit 1411, and sets the number of dimensions of the parameter vector 145. Calculate and set its initial value. The parameter learning unit 1413 learns the parameter vector 145 by a learning criterion algorithm based on the output candidate graph and the initial value of the parameter vector 145 set by the feature extraction unit 1412.

次に、図１９は、学習器１４１の処理手順を示すフローチャートである。図１９を参照しつつ、学習器１４１における処理手順を詳しく説明する。なお、以下の手順において、各構成要素が生成又は計算した情報は、ＲＡＭに一時記憶されるものとし、その記載は省略する。 Next, FIG. 19 is a flowchart showing a processing procedure of the learning device 141. The processing procedure in the learning device 141 will be described in detail with reference to FIG. In the following procedure, information generated or calculated by each component is temporarily stored in the RAM, and the description thereof is omitted.

まず、学習器１４１を動作させる準備として、学習に用いる評価指標を決定（主に対象とする問題の評価指標を利用）し、この決定した評価指標に基づいて評価の（最小）単位を決定する。そして、決定した評価指標及び評価の単位に基づいて、学習に用いる目的関数を決定しておく。 First, as preparation for operating the learning device 141, an evaluation index used for learning is determined (mainly using an evaluation index of a target problem), and a (minimum) unit of evaluation is determined based on the determined evaluation index. . Then, an objective function used for learning is determined based on the determined evaluation index and evaluation unit.

そして、学習器１４１の動作が開始され、学習器１４１の出力候補グラフ生成部１４１１は、ストレージ１４から学習データ１４４

を所得して（ステップＳ１０１）、ＲＡＭ１２に記憶する。 Then, the operation of the learning device 141 is started, and the output candidate graph generation unit 1411 of the learning device 141 receives the learning data 144 from the storage 14.

Is obtained (step S101) and stored in the RAM 12.

次に、出力候補グラフ生成部１４１１は、入出力部１３を介して対象とする問題に依存した出力ラベル候補集合を取得し（ステップＳ１０２）、ＲＡＭ１２に記憶する。そして、ＲＡＭ１２に記憶した学習データ及び出力ラベル候補集合に基づいて、出力候補グラフを生成する（ステップＳ１０３）。 Next, the output candidate graph generation unit 1411 acquires an output label candidate set depending on the target problem via the input / output unit 13 (step S102) and stores it in the RAM 12. Then, an output candidate graph is generated based on the learning data and the output label candidate set stored in the RAM 12 (step S103).

次に、出力候補グラフ生成部１４１１は、特徴抽出部１４１２に処理を受け渡し、特徴抽出部１４１２は、入出力部１３を介して特徴抽出テンプレートを取得し（ステップＳ１０４）、ＲＡＭ１２に記憶する。そして、出力候補グラフ生成部１４１１が生成した出力候補グラフ上のクリーク単位に特徴ベクトルを抽出し、パラメタベクトル１４５の次元数を計算する（ステップＳ１０５）。さらに、特徴抽出部１４１２は、パラメタベクトル１４５の初期値を設定する。なお、このパラメタベクトル１４５の初期値とは、例えば、パラメタベクトル１４５の各要素を、０と設定したものである。 Next, the output candidate graph generation unit 1411 passes the processing to the feature extraction unit 1412, and the feature extraction unit 1412 acquires a feature extraction template via the input / output unit 13 (step S104) and stores it in the RAM 12. Then, a feature vector is extracted for each clique on the output candidate graph generated by the output candidate graph generation unit 1411, and the number of dimensions of the parameter vector 145 is calculated (step S105). Further, the feature extraction unit 1412 sets an initial value of the parameter vector 145. The initial value of the parameter vector 145 is obtained by setting each element of the parameter vector 145 to 0, for example.

次に、特徴抽出部１４１２は、パラメタ学習部１４１３に処理を受け渡し、パラメタ学習部１４１３は、出力候補グラフ生成部１４１１が生成した出力候補グラフ及び特徴抽出部１４１２が計算した次元数を有するパラメタベクトル１４５の初期値に基づいて、学習基準アルゴリズムにより、パラメタベクトル１４５を学習し（ステップＳ１０６）、ストレージ１４に学習結果を記憶する。 Next, the feature extraction unit 1412 passes the processing to the parameter learning unit 1413, and the parameter learning unit 1413 outputs the parameter vector having the output candidate graph generated by the output candidate graph generation unit 1411 and the number of dimensions calculated by the feature extraction unit 1412. Based on the initial value of 145, the parameter vector 145 is learned by the learning reference algorithm (step S106), and the learning result is stored in the storage 14.

（パラメタ学習部における処理の詳細）
次に、図２０は、パラメタ学習部１４１３の詳細な処理手順を示すフローチャートである。図２０を参照しつつ、ステップＳ１０６におけるパラメタ学習部１４１３の処理をさらに詳細に説明する。 (Details of processing in the parameter learning unit)
Next, FIG. 20 is a flowchart showing a detailed processing procedure of the parameter learning unit 1413. The process of the parameter learning unit 1413 in step S106 will be described in further detail with reference to FIG.

図１９に示したフローチャートのステップＳ１０７に処理が移ると、パラメタ学習部１４１３は、その時点のパラメタベクトル１４５を用いて、各評価単位に対して正解出力のスコア（重み）を計算する（ステップＳ２０１）。そして、その時点のパラメタベクトル１４５を用いて、各評価単位に対して正解以外の出力候補（最尤出力候補）のスコアを計算する（ステップＳ２０２）。 When the process moves to step S107 in the flowchart shown in FIG. 19, the parameter learning unit 1413 calculates a correct output score (weight) for each evaluation unit using the parameter vector 145 at that time (step S201). ). Then, using the parameter vector 145 at that time, a score of an output candidate other than the correct answer (maximum likelihood output candidate) is calculated for each evaluation unit (step S202).

次に、パラメタ学習部１４１３は、各評価単位に対して、ステップＳ２０１で計算した正解出力のスコアと、ステップＳ２０２で計算した最尤出力候補のスコアとの差分を計算する（ステップＳ２０３）。そして、事前に決定された評価指標に基づいて目的関数Ｌを計算する（ステップＳ２０４）。 Next, the parameter learning unit 1413 calculates, for each evaluation unit, the difference between the correct output score calculated in step S201 and the maximum likelihood output candidate score calculated in step S202 (step S203). Then, the objective function L is calculated based on the evaluation index determined in advance (step S204).

次に、パラメタ学習部１４１３は、ステップＳ２０４で計算した目的関数Ｌの勾配∇Ｌ_λを計算して（ステップＳ２０５）、目的関数Ｌが収束したか否かを判定する（ステップＳ２０６）。この判定は、例えば、勾配∇Ｌ_λが所定値以下か否かで判定することができる。ここで、目的関数Ｌが収束してなければ（ステップＳ２０６で‘Ｎｏ’）、勾配∇Ｌ_λの値を用いてスコアを更新しステップＳ２０１に戻る。一方、目的関数Ｌが収束した場合は（ステップＳ２０６で‘Ｙｅｓ’）、その時点のパラメタベクトル１４５をストレージ１４に記憶して処理を終了する（図１９のステップＳ１０６に戻る）。 Next, the parameter learning unit 1413 may calculate the slope ∇L _lambda of the objective function L calculated in step S204 (step S205), determines whether or not the objective function L is converged (step S206). This determination may be, for example, gradient ∇L _lambda is determined by whether more than a predetermined value. Here, if there converge the objective function L (at step S206 'No'), updates the score using the value of the gradient ∇L _lambda returns to step S201. On the other hand, when the objective function L has converged (“Yes” in step S206), the parameter vector 145 at that time is stored in the storage 14, and the process is terminated (return to step S106 in FIG. 19).

次に、学習器１４１の動作（学習フェーズ）として、系列セグメンテーションを行う２つの具体的な例を示しつつ、さらに詳しく説明する（適宜、図１８〜２０参照）。 Next, the operation (learning phase) of the learning device 141 will be described in more detail while showing two specific examples of performing sequence segmentation (see FIGS. 18 to 20 as appropriate).

（具体例１:正解率を評価指数とする系列セグメンテーション）
具体例１では、系列セグメンテーション問題で、「評価単位」を「セグメント」とし、「評価指標」を「セグメント単位の正解率」とした場合の具体例を示す。この評価単位及び評価指標の決定は、図２０に示したステップＳ２０４において適用する評価指標（学習基準）を決定することに相当し、学習器１４１の処理の前準備として事前に決定される。
系列セグメンテーション問題で、前記のように「評価単位」を「セグメント」とし、「評価指標」を「セグメント単位の正解率」とした場合の目的関数は以下のようになる。 (Specific example 1: Series segmentation with accuracy rate as evaluation index)
Specific Example 1 shows a specific example in the case of “Segmentation Unit” as “Segment” and “Evaluation Index” as “Segment Correct Rate” in the sequence segmentation problem. The determination of the evaluation unit and the evaluation index corresponds to determining the evaluation index (learning standard) to be applied in step S204 shown in FIG. 20, and is determined in advance as a preparation for the processing of the learning device 141.
In the sequence segmentation problem, as described above, the objective function when the “evaluation unit” is “segment” and the “evaluation index” is “the accuracy rate per segment” is as follows.

ただし、正解率は、エラー率の反比例の関係にあり、数式（１９）の誤り推定関数は正解のスコアが大きいときには大きなマイナスの値をとるように定義されているので(一般的に機械学習分野では誤り最小化で定義するため)、実際にはエラー率最小化を行う。 However, the correct answer rate is inversely proportional to the error rate, and the error estimation function of Equation (19) is defined to take a large negative value when the correct answer score is large (generally, in the machine learning field). In this case, error rate minimization is actually performed.

ゆえに、目的関数を次のように定義する。 Therefore, the objective function is defined as follows.

なお、総セグメント数は定数になるので、目的関数には含めない。
その後、図１９に示したフローチャートしたがって、学習器１４１の出力候補グラフ生成部１４１１は、学習データ（タグ付きコーパス）を取得する（ステップＳ１０１）。また、同時に出力ラベル候補集合を取得する（ステップＳ１０２）。 Since the total number of segments is a constant, it is not included in the objective function.
After that, according to the flowchart shown in FIG. 19, the output candidate graph generation unit 1411 of the learning device 141 acquires learning data (tagged corpus) (step S101). At the same time, an output label candidate set is acquired (step S102).

ここで、図２１は、出力候補グラフ生成部１４１１が取得した学習データ等の例を示す図面である。図２１（ａ）は、学習データの例を示し、「田中・一郎・は・陸上・連盟・の・会長・です」（形態素区切りは事前にあるという設定）といったテキストにラベルが付与されているものが入力される。また、このときの出力ラベル候補は図２１（ｂ）に示すように、「Ｂ−人名，Ｉ−人名，Ｂ−組織名，Ｉ−組織名，Ｏ」の５種類のラベルが出力ラベル候補集合となる。 Here, FIG. 21 is a diagram illustrating an example of learning data or the like acquired by the output candidate graph generation unit 1411. FIG. 21 (a) shows an example of learning data, and a label is given to text such as “Tanaka, Ichiro, is land, federation, president, president” (setting that morpheme breaks are in advance). Things are entered. Further, as shown in FIG. 21B, the output label candidates at this time are five types of labels “B-person name, I-person name, B-organization name, I-organization name, O”. It becomes.

次に、取得した学習データに基づいて、出力候補グラフ生成部１４１１は出力候補グラフを生成する（ステップＳ１０３）。ここで、図２１（ｃ）は、出力候補グラフ生成部１４１１が、図２１（ａ）の学習データ及び図２１（ｂ）の出力ラベル集合に基づいて生成した出力候補グラフの例である。図２１（ｃ）に示した出力候補グラフは、可能性のある全ての出力候補をパスで接続したラティス形式をとる。つまり、出力候補グラフ中のＢＯＳノードからＥＯＳノード間の一つのパスが一つの出力に対応し、出力候補グラフは、とり得る全ての出力の候補を包含したグラフになっている。 Next, based on the acquired learning data, the output candidate graph generation unit 1411 generates an output candidate graph (step S103). Here, FIG. 21C is an example of an output candidate graph generated by the output candidate graph generation unit 1411 based on the learning data in FIG. 21A and the output label set in FIG. The output candidate graph shown in FIG. 21C takes a lattice form in which all possible output candidates are connected by a path. That is, one path between the BOS node and the EOS node in the output candidate graph corresponds to one output, and the output candidate graph is a graph including all possible output candidates.

さらに、出力候補グラフと特徴抽出テンプレートとを用いて、出力候補グラフの各ノード(またはリンク)に特徴ベクトルを付与する。
特徴抽出テンプレートは、図６に示したような形式であり、ここでは、前後２単語の特徴を用いて対象とする位置のノードの特徴ベクトルを作成する。このとき、前後２単語とノードが属する出力ラベルの組合せで特徴を生成するため、同じ位置で出力ラベルの違うノード間の特徴ベクトルは、お互い直行する（内積０になる）関係になる。 Further, a feature vector is assigned to each node (or link) of the output candidate graph using the output candidate graph and the feature extraction template.
The feature extraction template has a format as shown in FIG. 6, and here, a feature vector of a node at a target position is created using features of two words before and after. At this time, since the feature is generated by the combination of the two words before and after and the output label to which the node belongs, the feature vectors between the nodes having different output labels at the same position are orthogonal to each other (the inner product is 0).

具体的な例として、図２１（ｃ）に示した出力候補グラフに対して、図２２に示したような特徴ベクトルが各ノードに付与される。
以上の前処理を経て、特徴ベクトル付き出力候補グラフと、初期化されたパラメタベクトル（全ての要素が０のベクトル）とをパラメタ学習部１４１３に入力して、学習が開始される。 As a specific example, a feature vector as shown in FIG. 22 is given to each node for the output candidate graph shown in FIG.
Through the above pre-processing, the output candidate graph with feature vector and the initialized parameter vector (vector in which all elements are 0) are input to the parameter learning unit 1413, and learning is started.

次に、パラメタ学習部１４１３の動作について説明する（図２０参照）。例えば、図２３に示した「田中一郎は陸上連盟の会長です」というテキストと、正解出力が「Ｂ−人名，Ｉ−人名，Ｏ，Ｂ−組織名，Ｉ−組織名，Ｏ，Ｏ，Ｏ」の場合で、区間［４，５］を評価する場合を示す。
この区間での正解は、「Ｂ−組織名，Ｉ−組織名」なので、この二つのノードを通過する全てのパスのスコアを用いて、区間［４，５］で「Ｂ−組織名，Ｉ−組織名」が出力される周辺確率に基づいた決定関数を求める。この処理は、正解出力のスコアを計算する処理（ステップＳ２０１）に相当する。 Next, the operation of the parameter learning unit 1413 will be described (see FIG. 20). For example, the text “Ichiro Tanaka is the President of the Land Federation” shown in FIG. 23 and the correct output are “B-person name, I-person name, O, B-organization name, I-organization name, O, O, O ”Indicates a case where the interval [4, 5] is evaluated.
Since the correct answer in this section is “B-organization name, I-organization name”, using the scores of all paths passing through these two nodes, “B-organization name, I Determine a decision function based on the marginal probability that “organization name” is output. This process corresponds to the process of calculating the correct output score (step S201).

このときの計算式には、数式（１９）を用いる。前側確率のα_{［０，ｉ］}（ｙ_ｉ）は、ｉ＝４及びｙ_４＝Ｂ−組織名となる。つまり、位置４のＢ−組織名のノードに接続する前側のパス全てのスコアを表している。これは、forwardアルゴリズムで求められる。
また、同様に後側確率のβ_{［ｊ，ｎ］}（ｙ_ｊ）は、ｊ＝５及びｙ_５＝Ｉ−組織名であり、位置５のＩ−組織名のノードに接続する後側のパスのスコアである。これも、backwardアルゴリズムにより求められる。 Formula (19) is used as the calculation formula at this time. The front probability α _{[0, i]} (y _i ) becomes i = 4 and y ₄ = B-organization name. That is, the scores of all the paths on the front side connected to the B-organization name node at position 4 are represented. This is determined by the forward algorithm.
Similarly, β _{[j, n]} (y _j ) of the rear probability is j = 5 and y ₅ = I-organization name, and the rear path connected to the node of the I-organization name at position 5 Is the score. This is also determined by the backward algorithm.

次に、正解セグメントを除いて最もスコアの高いセグメントを求める。この処理は、最尤出力候補のスコアを計算する処理（ステップＳ２０２）に相当する。
これは、計算方法自体は正解セグメントの計算と同じであり、同一区間内で正解以外のとり得る全てのセグメントの計算を行い、最大値のセグメントを選出すればよい。 Next, the segment with the highest score is obtained except for the correct segment. This process corresponds to the process of calculating the score of the maximum likelihood output candidate (step S202).
The calculation method itself is the same as the calculation of the correct segment, and it is only necessary to calculate all possible segments other than the correct answer in the same section and select the segment with the maximum value.

ここで得られた正解セグメントと最尤出力候補セグメントのスコアを用いて、数式（１９）を計算する。この処理は、差分を計算する処理（ステップＳ２０３）に相当する。
これで、区間［４，５］に対する誤り推定を行うことができた。ここでは、セグメント単位の正解率を用いるので、平滑化関数として1ogistic関数を用いることを考える（図９参照）。これにより、誤り推定関数の値は［０，∞］の値域をとり、正解セグメントの決定関数の値が大きければ大きいほど値が０に近い値を返す。 Equation (19) is calculated using the correct segment and the maximum likelihood output candidate segment score obtained here. This process corresponds to a process for calculating a difference (step S203).
Thus, error estimation for the interval [4, 5] was performed. Here, since the accuracy rate in segment units is used, it is considered to use a 1 ogistic function as a smoothing function (see FIG. 9). As a result, the value of the error estimation function takes the range [0, ∞], and the larger the value of the correct segment decision function, the closer the value is to 0.

数式（２２）のセグメントエラー率は、単なるセグメント単位の誤り推定関数の総和で計算できるので、以上の処理を各セグメントに対しておこなう。この処理は、評価指標に基づいて目的関数Ｌを計算する処理（ステップＳ２０４）に相当する。 Since the segment error rate in the equation (22) can be calculated by simply summing the error estimation functions in segment units, the above processing is performed for each segment. This process corresponds to the process of calculating the objective function L based on the evaluation index (step S204).

そして、最終的に求めたいものは目的関数を最小化するパラメタベクトルλである。そこで、微分して０になるパラメタベクトルを求める。次に、パラメタ更新をおこなうために、目的関数（式（２２））の勾配を求める。この処理は、目的関数の勾配を計算する処理（ステップＳ２０５）に相当する。
数式（２２）の微分は、chain ruleを用いて以下のように分解して考えることができる。 What is finally desired is a parameter vector λ that minimizes the objective function. Therefore, a parameter vector that is differentiated to 0 is obtained. Next, in order to update parameters, the gradient of the objective function (formula (22)) is obtained. This process corresponds to the process of calculating the gradient of the objective function (step S205).
The differentiation of Equation (22) can be considered by decomposing as follows using a chain rule.

目的関数を平滑化関数ｌで微分したものは、1ogistic関数の定義にしたがって以下のようになる。 The objective function differentiated by the smoothing function l is as follows according to the definition of the 1 ogistic function.

ここで、誤り推定関数ｄ（）のパラメタに対する偏微分を示す。 Here, the partial differentiation with respect to the parameter of the error estimation function d () is shown.

誤り推定関数ｄ（）のパラメタの偏微分は、対象セグメントを通るパス上の各ノード（リンク）が出力される期待値となることがわかる。ゆえに、Forward-Backwardアルゴリズムを利用することで各パラメタの勾配を効率的に計算することができる。 It can be seen that the partial differentiation of the parameter of the error estimation function d () is an expected value at which each node (link) on the path passing through the target segment is output. Therefore, the gradient of each parameter can be efficiently calculated by using the Forward-Backward algorithm.

ここで計算した勾配を元に、一般的に用いられる数値最適化法を利用してパラメタを更新する。その際にパラメタの更新量が収束していたら学習は終了となり（ステップＳ２０６で‘Ｙｅｓ’）、パラメタベクトル１４５をストレージ１４に出力して（ステップＳ２０７）処理を終了する。
一方、収束していない場合は（ステップＳ２０６で‘Ｎｏ’）、パラメタ学習の最初に戻る。この収束判定は、例えば、パラメタベクトルの変化量が十分小さい場合（所定の閾値以下となった場合）に収束したとみなす。 Based on the gradient calculated here, the parameters are updated using a generally used numerical optimization method. At this time, if the parameter update amount has converged, the learning is terminated (“Yes” in step S206), the parameter vector 145 is output to the storage 14 (step S207), and the process is terminated.
On the other hand, if it has not converged (“No” in step S206), the process returns to the beginning of parameter learning. This convergence determination is considered to have converged, for example, when the amount of change in the parameter vector is sufficiently small (when it is below a predetermined threshold).

この具体例によると、目的関数として、正解セグメントの出現確率と正解セグメント以外で最も出力されやすいセグメントの出現確率との差分を用いるため、全確率を学習に用い、目的関数が連続凸関数となる。従来技術では、目的関数が非連続(凸)関数になる可能性があるため、通常の数値最適化法を単に適用するということはできなかった。つまり、非連続点に対応した数値最適化法を用いる必要があった。しかし、本具体例により、制約無しの一般的な数値最適化法を用いることが可能となり、計算効率を向上させることが可能となる。 According to this specific example, since the difference between the appearance probability of the correct segment and the appearance probability of the segment that is most likely to be output other than the correct segment is used as the objective function, all the probabilities are used for learning, and the objective function becomes a continuous convex function. . In the prior art, since the objective function may be a discontinuous (convex) function, the usual numerical optimization method cannot be simply applied. In other words, it is necessary to use a numerical optimization method corresponding to discontinuous points. However, according to this specific example, it is possible to use a general numerical optimization method without restriction, and it is possible to improve calculation efficiency.

（具体例２：Ｆ値を評価指標とする系列セグメンテーション）
次に、系列セグメンテーションで「評価単位」を「セグメント」とし、「評価指標」を「Ｆ値」とした場合の具体例を示す。ここでは、数式（１１）に示したＦ値を最大化する場合を示す。これには、以下の式を最小化することと等価である。 (Specific example 2: Series segmentation using F value as evaluation index)
Next, a specific example in the case where “evaluation unit” is “segment” and “evaluation index” is “F value” in sequence segmentation will be described. Here, the case where the F value shown in Formula (11) is maximized is shown. This is equivalent to minimizing the following equation:

ただし、Ｍは正例セグメント数とする。また、ＦＰ_ｌ，ＦＮ_ｌは学習時のＦＰ（false positive）とＦＮ（false negative）の推定値を表す。これは、数式（１１）を、次のように変形した際の分母であり、分母を最小化することが全体の最大化になるということから導出される。 However, M is the number of positive example segments. Further, FP ₁ and FN ₁ represent estimated values of FP (false positive) and FN (false negative) during learning. This is a denominator when Equation (11) is transformed as follows, and is derived from the fact that minimizing the denominator maximizes the whole.

数式（２９）から、ＦＰ，ＦＮの推定量ＦＰ_ｌ，ＦＮ_ｌを計算することができれば、Ｆ値の最大化が可能となる。よって、正解率ではなく、系列セグメンテーションで数式（１１）に示されたＦ値を最大化したい場合は、数式（２８）を目的関数として学習をおこなう。以下、ＦＰ_ｌ，ＦＮ_ｌの導出手順を説明する。 If the estimated values FP _l and FN _l of FP and FN can be calculated from the equation (29), the F value can be maximized. Therefore, when it is desired to maximize the F value shown in the formula (11) by the sequence segmentation instead of the correct answer rate, the learning is performed using the formula (28) as an objective function. Hereinafter, a procedure for deriving FP _l and FN _l will be described.

まず、ＦＮ_ｌについて説明する。前記のようにＦＮ（false negative）、つまり、正例を間違える推定量であるＦＮ_ｌは、誤り推定関数を用いて次のように定義できる。 First, FN _l will be described. As described above, FN (false negative), that is, FN _l , which is an estimated amount of a positive example, can be defined as follows using an error estimation function.

ただし、ＦＮ_ｌは正例を間違える推定量なので、正例に関してのみ計算を行えばよいため、２番目の総和は正例の正解セグメントに対してのみ計算する。 However, since FN _l is an estimated amount that makes a mistake in the correct example, it is only necessary to calculate for the correct example, so the second sum is calculated only for the correct answer segment of the correct example.

次に、ＦＰ_ｌについて示す。ＦＰ_ｌは、正例へ間違える推定量なので、負例への間違いは考慮されない。よって、誤り推定関数を計算する際に、負例への誤りは計算に含めないように式を変形する必要がある。 Next, the FP _l. Since FP _l is an estimated amount that is mistaken for a positive example, an error for a negative example is not considered. Therefore, when calculating the error estimation function, it is necessary to modify the equation so that errors in negative examples are not included in the calculation.

ただし、正解セグメントがＯであった場合は、負例への間違いというのは存在しないので、数式（１９）の計算と等価となる。よって、正例へ間違える推定量であるＦＰ_ｌは、数式（３１）の誤り推定関数を用いて以下のように定義できる。

However, when the correct answer segment is O, there is no error in the negative example, and this is equivalent to the calculation of Expression (19). Therefore, FP ₁ , which is an estimation amount mistaken for a positive example, can be defined as follows using the error estimation function of Equation (31).

ＦＮ_ｌに対してＦＰ_ｌは、とりうる全てのセグメントに対して計算を行う。
以下、パラメタ学習部で勾配を求める際の式は以下のようになる。 For FN _l FP _l computes for all possible segments.
Hereinafter, the equation for obtaining the gradient in the parameter learning unit is as follows.

以下、それぞれのコンポーネント毎に示す。
まず、目的関数をＦＮ_ｌとＦＰ_ｌで偏微分したものは以下のようになる。 The following is shown for each component.
First, the partial differentiation of the objective function by FN _l and FP _l is as follows.

ただし、Ｚ_ＮとＺ_Ｄはそれぞれ数式（２８）の分子と分母の値を示す。
次に、ＦＮ_ｌとＦＰ_ｌとを平滑化関数lで偏微分したものは、ＦＮ_ｌとＦＰ_ｌとがlの単なる線形和で表されるため定数となる。
平滑化関数l（ここではシグモイド関数を適用)の偏微分は定義にしたがって、次のようになる。 However, each _{Z N} and _{Z D} shows the numerator and denominator values of the equation (28).
Next, the partial differentiation of FN _l and FP _l by the smoothing function l becomes a constant because FN _l and FP _l are represented by a simple linear sum of l.
The partial differentiation of the smoothing function l (here the sigmoid function is applied) is as follows according to the definition.

最後に、誤り推定関数ｄ（）とｄ’（）のパラメタに対する偏微分を示す。 Finally, the partial differentiation with respect to the parameters of the error estimation functions d () and d ′ () is shown.

以下、評価単位と目的関数の定義、及び目的関数と目的関数の勾配の計算以外の処理は、セグメント正解率を用いて学習する場合と全く同じ処理になるためここでは省略する。 Hereinafter, the processing other than the definition of the evaluation unit and the objective function, and the calculation of the gradient of the objective function and the objective function is the same as the case of learning using the segment correct answer rate, and is omitted here.

以上、本発明の実施形態を説明したが、前記の説明において用いた数式等は、例示しものにすぎず、適用する言語モデルに応じて適宜変更可能である。よって、本発明の範囲は、特許請求の範囲に記載された技術的思想により定められる。 Although the embodiment of the present invention has been described above, the mathematical formulas and the like used in the above description are merely examples, and can be appropriately changed according to the language model to be applied. Therefore, the scope of the present invention is defined by the technical idea described in the claims.

系列セグメンテーション及び系列ラベリングの例を示す図面である。It is drawing which shows the example of series segmentation and series labeling. ＩＯＢラベリング法を説明する図面である。It is drawing explaining the IOB labeling method. 品詞タグ付けの例を示す図面である。It is drawing which shows the example of part-of-speech tagging. カスケードモデルを適用した系列ラベリングの例を示す図面である。It is drawing which shows the example of the sequence labeling to which a cascade model is applied. 条件付き確率場を適用した系列ラベリングの例を示す図面である。It is drawing which shows the example of the sequence labeling which applied the conditional random field. 特徴抽出の例を示す図面である。It is drawing which shows the example of feature extraction. 条件付き確率場におけるクリークを説明する図面である。It is drawing explaining the clique in a conditional random field. 条件付き確率場におけるクリーク毎の特徴の例を示す図面である。It is drawing which shows the example of the characteristic for every clique in a conditional random field. 平滑化関数の例を示す図面である。It is drawing which shows the example of a smoothing function. 従来技術の誤り推定計算方法を説明する図面である。6 is a diagram illustrating a conventional error estimation calculation method. 従来技術の問題点を説明する図面である。It is drawing explaining the problem of a prior art. セグメントの出力確率の例を説明する図面である。It is drawing explaining the example of the output probability of a segment. 出力候補グラフにおける数式（１９）の意味を説明する図面である。It is drawing explaining the meaning of Numerical formula (19) in an output candidate graph. テキスト解析装置のブロック構成図である。It is a block block diagram of a text analysis apparatus. テキスト解析装置の処理動作の概略を説明する説明図である。It is explanatory drawing explaining the outline of the processing operation of a text analysis apparatus. 学習器の入力・出力例を示す図面である。It is drawing which shows the example of input and output of a learning device. 推定器の入力・出力例を示す図面である。It is drawing which shows the example of input and output of an estimator. 学習器の詳細ブロックを示す図面である。It is drawing which shows the detailed block of a learning device. 学習器の処理手順を示すフローチャートである。It is a flowchart which shows the process sequence of a learning device. パラメタ学習部の詳細な処理手順を示すフローチャートである。It is a flowchart which shows the detailed process sequence of a parameter learning part. （ａ）は入力される学習データの例を示し、（ｂ）抽出される主力ラベル候補集合を示し、（ｃ）は生成される出力候補グラフの例を示す図面である。(A) shows an example of input learning data, (b) shows a main label candidate set to be extracted, and (c) shows an example of a generated output candidate graph. 出力候補グラフの各ノードに特徴ベクトルを付与した例を示す図面である。It is drawing which shows the example which provided the feature vector to each node of an output candidate graph. 学習データの例において、区間［４，５］を評価する場合を説明する図面である。It is drawing explaining the case where interval [4, 5] is evaluated in the example of learning data.

Explanation of symbols

１１ＣＰＵ
１２ＲＡＭ
１３入手力部
１４ストレージ
１４１学習器
１４４学習データ
１４５パラメタベクトル
１４１１出力候補グラフ生成部
１４１２特徴抽出部
１４１３パラメタ学習部 11 CPU
12 RAM
DESCRIPTION OF SYMBOLS 13 Acquisition power part 14 Storage 141 Learner 144 Learning data 145 Parameter vector 1411 Output candidate graph generation part 1412 Feature extraction part 1413 Parameter learning part

Claims

A language analysis model learning device that calculates a parameter vector used for assigning a predetermined classification tag to text data from learning data that is text data to which the classification tag is assigned in advance,
An output candidate graph that associates an output label candidate indicating the type of the classification tag with the learning data, and a storage unit that stores an initial value of the parameter vector;
Using the decision function that expresses the appearance probability of the predetermined evaluation unit of the learning data by the peripheral probability using the parameter vector as a variable, the most likely output is the appearance probability of the correct element of the learning data and other than the correct element. A language analysis comprising: a parameter learning unit that calculates a difference from an appearance probability of an easy element as an error estimation function and calculates a parameter vector that optimizes an objective function set using the error estimation function Model learning device.

The objective function is a sum of outputs of the error estimation function for all minimum evaluation units, and a parameter vector that optimizes the objective function is the parameter vector that minimizes the output of the objective function. The language analysis model learning device according to claim 1, wherein

The objective function is a denominator of the F value shown in the following equation (1) set using the error estimation function, and the parameter vector that optimizes the objective function minimizes the output of the objective function. The language analysis model learning device according to claim 1, wherein the parameter vector is a parameter vector.

Where γ is a constant indicating the degree of trade-off between recall and precision, M is the number of positive segment, FP _l is an estimated number of errors to a negative positive example, and FN _l is a negative negative example The estimated amount of errors is shown respectively.

The parameter learning unit
The language analysis model learning device according to any one of claims 1 to 3, wherein an output of the decision function is calculated using a Forward-Backward algorithm.

The parameter learning unit
The gradient of the objective function is calculated, and the parameter vector for optimizing the objective function is calculated again using the parameter vector at that time until the gradient becomes a predetermined value or less. The language analysis model learning device according to claim 4.

A language analysis model learning method in a language analysis model learning device that calculates a parameter vector used for giving a predetermined classification tag to text data from learning data that is text data to which the classification tag has been assigned in advance,
The storage unit of the language analysis model learning device stores an output candidate graph in which an output label candidate indicating the type of the classification tag is associated with the learning data, and an initial value of the parameter vector,
The parameter learning unit of the language analysis model learning device includes:
Using a decision function that represents the probability of occurrence of a predetermined evaluation unit of the learning data by a peripheral probability with the parameter vector as a variable, the probability of occurrence of the correct element of the learning data is calculated,
Using the decision function, calculate the appearance probability of the element that is most likely to be output other than the correct answer element,
Calculating the difference between the occurrence probability of the correct element and the appearance probability of the element that is most likely to be output other than the correct element as an error estimation function;
A language analysis model learning method, comprising: calculating a parameter vector that optimizes an objective function set by using the error estimation function.

The objective function is a sum of outputs of the error estimation function for all minimum evaluation units, and a parameter vector that optimizes the objective function is the parameter vector that minimizes the output of the objective function. The language analysis model learning method according to claim 6, wherein:

The objective function is a denominator of the F value shown in the following equation (1) set using the error estimation function, and the parameter vector that optimizes the objective function minimizes the output of the objective function. The language analysis model learning method according to claim 6, wherein the parameter vector is a parameter vector.

The parameter learning unit
The language analysis model learning method according to any one of claims 6 to 8, wherein an output of the decision function is calculated using a Forward-Backward algorithm.

The parameter learning unit
After calculating the parameter vector, calculating the gradient of the objective function, and calculating again the parameter vector that optimizes the objective function using the parameter vector at that time until the gradient becomes a predetermined value or less. The language analysis model learning method according to claim 6, wherein the language analysis model is learned.

A language analysis model learning program that causes a computer to function as the language analysis model learning device according to any one of claims 1 to 5.

A recording medium in which the language analysis model learning program according to claim 9 is recorded.