JP5566704B2

JP5566704B2 - Word boundary judgment device

Info

Publication number: JP5566704B2
Application number: JP2010006049A
Authority: JP
Inventors: 正柳原; 一則松本; 康弘滝嶋; 和史池田
Original assignee: KDDI R&D Laboratories Inc
Current assignee: KDDI Research Inc
Priority date: 2010-01-14
Filing date: 2010-01-14
Publication date: 2014-08-06
Anticipated expiration: 2030-01-14
Also published as: JP2011145885A

Description

本発明は、単語境界判定装置に関する。 The present invention relates to a word boundary determination device.

形態素解析において、単語として特定できない文字列（以降、「未知文字列」と呼ぶ）が出力されることが多い。一般に、形態素解析装置の主部（以下、「形態素解析エンジン」という）によって参照される辞書（以下、「形態素解析用辞書」という）に登録されていない文字列が未知文字列として出力される。 In morphological analysis, character strings that cannot be specified as words (hereinafter referred to as “unknown character strings”) are often output. In general, a character string that is not registered in a dictionary (hereinafter referred to as “dictionary for morpheme analysis”) that is referred to by the main part of the morpheme analyzer (hereinafter referred to as “morpheme analysis engine”) is output as an unknown character string.

文字列から単語を正しく特定するための技術に関し、n-gramの統計情報を用いて、未知文字列のうち、単語となる境界を推定し、単語と推定した箇所に対し、品詞を推定する方式も考えられる（非特許文献１参照）。例えば、非特許文献１に係る論文における方法では、n-gramの統計情報を用いて、文字の出現頻度から計算した確率を基にした文字間の関連度を元に、文字列から単語を生成する。その後は閾値を用いることで、単語の品詞を推定するという方式を採る。また、この他に、閾値はデータによって異なることが多いため、入力データを変更する都度、閾値を再調整する。 A method for estimating the word part of an unknown character string by using n-gram statistical information and estimating the word boundary of the unknown character string. Is also conceivable (see Non-Patent Document 1). For example, in the method in the paper related to Non-Patent Document 1, using n-gram statistical information, a word is generated from a character string based on the degree of association between characters based on the probability calculated from the appearance frequency of characters. To do. Thereafter, a method of estimating the part of speech of the word by using a threshold is adopted. In addition, since the threshold value often varies depending on the data, the threshold value is readjusted every time the input data is changed.

「nグラム統計によるコーパスからの未知語抽出」著者森信介、長尾眞、情報処理学会論文誌、Vol.95,No.168,pp.7-12,1998"Unknown word extraction from corpus by n-gram statistics" Author Shinsuke Mori, Atsushi Nagao, Transactions of Information Processing Society of Japan, Vol.95, No.168, pp.7-12, 1998 Kazunori Matsumoto, Kazuo Hashimoto, "Schema Design for Causal Law Mining from Incomplete Database", Discovery Science, Second International Conference, DS '99, Tokyo, Japan, December, 1999, Proceedings. Lecture Notes in Computer Science 1721 Springer, pp.92-102, 1999.Kazunori Matsumoto, Kazuo Hashimoto, "Schema Design for Causal Law Mining from Incomplete Database", Discovery Science, Second International Conference, DS '99, Tokyo, Japan, December, 1999, Proceedings. Lecture Notes in Computer Science 1721 Springer, pp.92 -102, 1999.

しかしながら、非特許文献１に係る論文における方法には、以下の問題がある。統計情報は確率によって表現されるが、確率を用いる場合、もともと保持していた情報量の信頼性が破棄されてしまうという問題がある。例えば、１００文中１０回登場した単語は、１０文中１回登場した単語に比べ、情報量の観点から言えば信頼性が高いが、確率を用いる場合、共に単に確率「０．１」として取り扱われ、情報量の信頼性が破棄される。さらに、非特許文献１では、任意の文字列に後続する文字との関連を検証するが、文字列の前に存在する文字との関連も同時に検証する場合と比べ、精度が落ちてしまう欠点が挙げられる。また、閾値を使う場合では線形的に境界を判別することになるため、精度のことを踏まえ、非線形的な判別が可能な単語境界の推定方式を利用することが望ましい。 However, the method in the paper related to Non-Patent Document 1 has the following problems. Although the statistical information is expressed by a probability, there is a problem that the reliability of the amount of information originally held is discarded when the probability is used. For example, a word that appears 10 times in 100 sentences is more reliable from the viewpoint of the amount of information than a word that appears once in 10 sentences. However, when a probability is used, both words are treated simply as a probability “0.1”. The reliability of the information amount is discarded. Furthermore, in Non-Patent Document 1, the relationship with a character that follows an arbitrary character string is verified. However, there is a drawback in that the accuracy is reduced as compared with the case where the relationship with a character existing before the character string is also verified at the same time. Can be mentioned. In addition, when the threshold value is used, the boundary is determined linearly. Therefore, it is desirable to use a word boundary estimation method capable of nonlinear determination based on accuracy.

本発明は、上述した課題に鑑みてなされたものであって、高い信頼性で文字列内の単語境界を判定する技術を提供することを目的とする。 The present invention has been made in view of the above-described problems, and an object of the present invention is to provide a technique for determining a word boundary in a character string with high reliability.

上記問題を解決するために、本発明の一態様である単語境界判定装置は、文字列間の接合度を示す接合スコア毎に、または、前記接合スコアの範囲に応じて分類された接合スコア群毎に、文字列間の単語境界の存否の確率を示す単語境界存否確率を記憶する単語境界存否確率記憶部と、前記接合スコアの信頼性を示す信頼性スコアを付与する信頼性スコア付与部と、単語境界の判定対象の文字列に含まれる文字列間の前記接合スコアの前記信頼性スコアを用いて当該文字列内の単語境界を判定する判定部とを備え、前記信頼性スコア付与部は、一の文字列間の前記接合スコアの前記信頼性スコアとして、前記単語境界存否確率記憶部に記憶されている当該文字列間の前記接合スコアに対応する前記単語境界存否確率を付与することを特徴とする。 In order to solve the above problem, the word boundary determination device according to one aspect of the present invention is a joint score group classified according to a joint score indicating a joint degree between character strings or according to the range of the joint score. A word boundary presence / absence probability storage unit that stores a word boundary presence / absence probability indicating a probability of presence / absence of a word boundary between character strings, and a reliability score giving unit that provides a reliability score indicating the reliability of the joint score; A determination unit that determines a word boundary in the character string using the reliability score of the joint score between character strings included in a character string to be determined as a word boundary, and the reliability score giving unit includes The word boundary existence probability corresponding to the joint score between the character strings stored in the word boundary existence probability storage unit is assigned as the reliability score of the joint score between one character string. Features .

上記単語境界判定装置は、第１の文字列と第２の文字列との間の前記接合スコアを算出する接合スコア算出部を更に備え、前記接合スコア算出部は、文章内において前記第１の文字列に続いて前記第２の文字列が出現した第１の出現回数と、文章内において前記第１の文字列に続いて前記第２の文字列以外の文字列が出現した第２の出現回数と、文章内において前記第１の文字列以外の文字列に続いて前記第２の文字列が出現した第３の出現回数と、文章内において前記第１の文字列以外の文字列に続いて前記第２の文字列以外の文字列が出現した第４の出現回数とを集計し、前記第１の出現回数、前記第２の出現回数、前記第３の出現回数および前記第４の出現回数に基づいて、前記第１の文字列と第２の文字列との間の前記接合スコアを算出するようにしてもよい。 The said word boundary determination apparatus is further provided with the joining score calculation part which calculates the said joining score between a 1st character string and a 2nd character string, The said joining score calculation part is a said 1st character string in a sentence. A first appearance number of times the second character string appears following the character string, and a second appearance that a character string other than the second character string appears after the first character string in the sentence The number of times, the third number of appearances of the second character string following the character string other than the first character string in the sentence, and the character string other than the first character string in the sentence And the fourth appearance count of occurrence of character strings other than the second character string, and the first appearance count, the second appearance count, the third appearance count, and the fourth appearance count. Based on the number of times, the joint score between the first character string and the second character string is calculated. It may be out.

具体的には、前記接合スコア算出部は、前記第１の出現回数をａ、前記第２の出現回数をｂ、前記第３の出現回数をｃ、前記第４の出現回数をｄ、ａ＋ｂをｈ、ａ＋ｃをｋ、ａ＋ｂ＋ｃ＋ｄをｎとしたときに、下記算術式に従って第１のスコアと第２のスコアを算出し、前記第１の文字列と第２の文字列との間の前記接合スコアとして、前記第１のスコアと前記第２のスコアの差を算出してもよい。
（第１のスコアの算術式）
第１のスコア＝−２×｛ｈｌｏｇｈ＋ｋｌｏｇｋ＋（ｎ−ｈ）ｌｏｇ（ｎ−ｈ）＋（ｎ−ｋ）ｌｏｇ（ｎ−ｋ）−２ｎｌｏｇｎ｝＋２×３
（第２のスコアの算術式）
第２のスコア＝−２×｛ａｌｏｇａ＋ｂｌｏｇｂ＋ｃｌｏｇｃ＋ｄｌｏｇｄ−ｎｌｏｇｎ｝＋２×２ Specifically, the joint score calculation unit sets the first appearance count to a, the second appearance count to b, the third appearance count to c, the fourth appearance count to d, and a + b. When h, a + c is k, and a + b + c + d is n, the first score and the second score are calculated according to the following arithmetic expression, and the joint score between the first character string and the second character string As an alternative, the difference between the first score and the second score may be calculated.
(Arithmetic formula for the first score)
First score = −2 × {hlog + klogk + (n−h) log (n−h) + (n−k) log (n−k) −2nlogn} + 2 × 3
(Arithmetic formula for the second score)
Second score = −2 × {loga + blogb + logc + dlogd−nlogn} + 2 × 2

前記判定部は、単語境界に係る情報を有する単語境界の判定対象の文字列に含まれる文字列間の前記接合スコアの前記信頼性スコアに基づいて、ｎ番目の文字列間の前記接合スコア（Ｅ _ｎ）の信頼性スコアのうち、単語境界が存在する確率を信頼性スコアＰ（Ｅ _ｎ） _Ａ、単語境界が存在しない確率を信頼性スコアＰ（Ｅ _ｎ） _Ｂとしたときに、前記接合スコア（Ｅ _ｎ）に対して夫々のエントロピー値（Ｉ _ｎ）を、前記信頼性スコアＰ（Ｅ _ｎ） _Ａと前記信頼性スコアＰ（Ｅ _ｎ） _Ｂとを用いて算出し、算出した夫々のエントロピー値（Ｉ _ｎ）と所定の閾値（Ｃ _１）との大小関係に基づいて、当該文字列内の単語境界の存否を判定するようにしてもよい。 The said determination part is based on the said reliability score of the said joint score between the character strings contained in the character string of the judgment object of the word boundary which has the information which concerns on a word boundary, The said joining score between nth character strings ( among the confidence score of E _n), the probability that the word boundaries exist confidence score P (E _n) _a, the probability that a word boundary is not present when the confidence score P (E _n) _B, the joint Each entropy value (I _n ) is calculated with respect to the score (E _n ) using the reliability score P (E _n ) _A and the reliability score P (E _n ) _B. Based on the magnitude relationship between the entropy value (I _n ) and the predetermined threshold (C ₁ ), the presence or absence of a word boundary in the character string may be determined.

前記判定部は、単語境界に係る情報を有しない単語境界の判定対象の文字列に含まれる文字間の前記接合スコアおよび前記接合スコアの前記信頼性スコアに基づいて、ｎ番目の文字列間の前記接合スコア（Ｅ _ｎ）の信頼性スコアのうち、単語境界が存在する確率を信頼性スコアＰ（Ｅ _ｎ） _Ａ、単語境界が存在しない確率を信頼性スコアＰ（Ｅ _ｎ） _Ｂとしたときに、前記接合スコア（Ｅ _ｎ）に対して夫々のエントロピー値（Ｉ _ｎ）を、前記信頼性スコアＰ（Ｅ _ｎ） _Ａと前記信頼性スコアＰ（Ｅ _ｎ） _Ｂとを用いて算出し、算出した夫々のエントロピー値（Ｉ _ｎ）と所定の閾値（Ｃ _１）との大小関係、および、前記接合スコア（Ｅ _ｎ）と所定の閾値（Ｃ _２）との大小関係とに基づいて、当該文字列内の単語境界の存否を判定するようにしてもよい。 The determination unit determines whether the n-th character string is based on the joint score between the characters included in the character string to be determined for the word boundary that does not have information related to the word boundary and the reliability score of the joint score . among the confidence score of the joint score (E _n), the probability that the word boundaries exist confidence score P (E _n) _a, when the probability that a word boundary does not exist and the reliability score P (E _n) _B And calculating each entropy value (I _n ) for the junction score (E _n ) using the reliability score P (E _n ) _A and the reliability score P (E _n ) _B , Based on the magnitude relationship between each calculated entropy value (I _n ) and a predetermined threshold (C ₁ ), and the magnitude relationship between the joint score (E _n ) and the predetermined threshold (C ₂ ), Determine if there is a word boundary in the string Unishi may be.

単語境界の判定結果に基づいて、前記学習データを更新する更新部を更に備えるようにしてもよい。 You may make it further provide the update part which updates the said learning data based on the determination result of a word boundary .

単語境界の判定結果に基づいて、前記学習データを更新する更新部を備え、前記更新部は、一の対象文字列内の各エントロピー値（Ｉ _ｎ）の平均値Ｉ _ＡＶＥと所定の閾値（Ｃ _３）の大小関係を比較し、平均値Ｉ _ＡＶＥ＜閾値（Ｃ _３）の場合には、当該対象文字列を学習データに更新するようにしてもよい。また、一の対象文字列内の前記接合スコア（Ｅ _ｎ）の平均値Ｅ _ＡＶＥと所定の閾値（Ｃ _４）の大小関係を比較するとともに、当該対象文字列内の各エントロピー値（Ｉ _ｎ）の平均値Ｉ _ＡＶＥと所定の閾値（Ｃ _３）の大小関係を比較し、平均値Ｅ _ＡＶＥ ≧閾値（Ｃ _４）、かつ、Ｉ _ＡＶＥ＜閾値（Ｃ _３）の場合には、当該対象文字列を学習データに更新するようにしてもよい。 Based on the determination result of word boundaries, it includes an update unit for updating the training data, prior Symbol update unit, the average value I _AVE with a predetermined threshold of each entropy value in one subject string (I _n) The magnitude relationship of (C ₃ ) is compared, and when the average value I _AVE <threshold (C ₃ ), the target character string may be updated to learning data. Further, the magnitude relationship between the average value E _{AVE of the} joint score (E _n ) in one target character string and a predetermined threshold (C ₄ ) is compared, and each entropy value (I _n ) in the target character string is compared. The average value I _AVE is compared with a predetermined threshold (C ₃ ), and if the average value E _AVE ≧ threshold (C ₄ ) and I _AVE <threshold (C ₃ ), the target character string May be updated to learning data.

本発明によれば、高い信頼性で文字列内の単語境界を判定することができるようになる。 According to the present invention, a word boundary in a character string can be determined with high reliability.

本発明の第１の実施形態による単語境界判定装置の機能ブロック図の一例である。It is an example of the functional block diagram of the word boundary determination apparatus by the 1st Embodiment of this invention. 接合スコア算出部による接合スコアの生成過程を説明する図である。It is a figure explaining the production | generation process of the joining score by a joining score calculation part. 接合スコア記憶部に記憶されている情報の一例である。It is an example of the information memorize | stored in the joining score memory | storage part. 単語境界存否確率記憶部に記憶されている情報の一例である。It is an example of the information memorize | stored in the word boundary existence probability memory | storage part. 単語境界判定装置の動作の一例を示すフローチャートである。It is a flowchart which shows an example of operation | movement of a word boundary determination apparatus.

以下、本発明の第１の実施形態について図面を参照して詳細に説明する。本発明の第１の実施形態による単語境界判定装置１は、図１に示すように、接合スコア算出部１０、単語境界推定部２０、単語境界存否確率算出部３０、抽出部４０、信頼性スコア付与部５０、学習データ更新部６０、ラベル有データ記憶部９０、接合スコア記憶部９２、ラベル有データ記憶部９４および単語境界存否確率記憶部９６を備える。 Hereinafter, a first embodiment of the present invention will be described in detail with reference to the drawings. As shown in FIG. 1, the word boundary determination device 1 according to the first exemplary embodiment of the present invention includes a joint score calculation unit 10, a word boundary estimation unit 20, a word boundary existence probability calculation unit 30, an extraction unit 40, and a reliability score. A provision unit 50, a learning data update unit 60, a labeled data storage unit 90, a joint score storage unit 92, a labeled data storage unit 94, and a word boundary existence probability storage unit 96 are provided.

ラベル有データ記憶部９０は、単語境界を含む文章データを記憶する。ラベル有データ記憶部９０に記憶される文章データは、学習データとして、ユーザによって入力された品詞無単語データである。また、ラベル有データ記憶部９０に記憶される文章データは、未知文字列を多く含むものであることが好ましい。 The labeled data storage unit 90 stores sentence data including word boundaries. The sentence data stored in the labeled data storage unit 90 is part-of-speech non-word data input by the user as learning data. Moreover, it is preferable that the sentence data memorize | stored in the labeled data storage part 90 contain many unknown character strings.

接合スコア算出部１０は、ラベル有データ記憶部９０に記憶されている文章データ（学習データ）を用いて、接合スコアを算出する。接合スコアとは、文字列間の接合度を示す指標である。接合スコアは、学習データとして与えられる文章に含まれる文字列（１以上の文字から構成される文字列）を対象として、当該文章中において当該文字列の前後に出現する文字の分布を集計して算出される。接合スコアの値は、文章内において、ある文字列と隣接する他の文字列の間に単語境界が成立しない事象が多いほど大きい。つまり、接合スコアの値が大きければ大きいほど、両文字列間に単語境界が成立し難いことを意味する。 The joint score calculation unit 10 calculates a joint score using sentence data (learning data) stored in the labeled data storage unit 90. The joining score is an index indicating the degree of joining between character strings. The joint score is obtained by counting the distribution of characters appearing before and after the character string in the sentence for a character string (character string composed of one or more characters) included in the sentence given as learning data. Calculated. The value of the joint score increases as the number of events in which a word boundary is not established between a certain character string and another adjacent character string in the sentence increases. In other words, the larger the value of the joint score, the more difficult it is to establish a word boundary between both character strings.

以下、接合スコア算出部１０の接合スコア算出機能について詳細に説明する。接合スコア算出部１０は、文章内の一の文字列と、当該文章内の当該一の文字列の前後に出現する出現文字列とから構成される組別に、当該文章内における前記出現文字列の出現回数を集計し、組別の出現回数に基づいて、当該一の文字列と出現文字列との間の接合スコアを算出する。具体的には、接合スコア算出部１０は、モデル検定による評価手法を活用し、文字（列）間の関連度（接合度）を計測する。 Hereinafter, the joining score calculation function of the joining score calculation unit 10 will be described in detail. The joint score calculation unit 10 divides the appearance character string in the sentence into a set composed of one character string in the sentence and appearance character strings appearing before and after the one character string in the sentence. The number of appearances is totaled, and a joint score between the one character string and the appearance character string is calculated based on the number of appearances for each group. Specifically, the joint score calculation unit 10 measures the degree of association (joint degree) between characters (columns) using an evaluation method based on a model test.

具体的には、接合スコア算出部１０は、第１の文字列と第２の文字列との間の接合スコアを算出する場合、文章内において第１の文字列に続いて第２の文字列が出現した第１の出現回数と、文章内において第１の文字列に続いて第２の文字列以外の文字列が出現した第２の出現回数と、文章内において第１の文字列以外の文字列に続いて第２の文字列が出現した第３の出現回数と、文章内において第１の文字列以外の文字列に続いて第２の文字列以外の文字列が出現した第４の出現回数とを集計し、第１の出現回数、第２の出現回数、第３の出現回数および第４の出現回数に基づいて、第１の文字列と第２の文字列との間の接合スコアを算出する。 Specifically, when the joint score calculation unit 10 calculates the joint score between the first character string and the second character string, the second character string follows the first character string in the sentence. The first number of appearances, the second number of appearances of a character string other than the second character string following the first character string in the sentence, and the number of occurrences other than the first character string in the sentence A third appearance number of times the second character string appears following the character string, and a fourth occurrence number of the character string other than the second character string following the character string other than the first character string in the sentence. The number of appearances is totaled, and the connection between the first character string and the second character string based on the first appearance number, the second appearance number, the third appearance number, and the fourth appearance number Calculate the score.

より詳細には、接合スコア算出部１０は、ｋ−ｓｔｒｉｎｇとｖ−ｓｔｒｉｎｇの組毎に、図２（ａ）に示すように、出現回数ａ１１、ａ１２、ａ２１、ａ２２を集計する。“ｋ−ｓｔｒｉｎｇ”はＮ−ｇｒａｍであって上述の「第１の文字列」に該当し、“ｖ−ｓｔｒｉｎｇ”はｋ−ｓｔｒｉｎｇに対し、接合すべきかの判定対象である文字列であって上述の「第２の文字列」に該当する。つまり、ｋ−ｓｔｒｉｎｇとｖ−ｓｔｒｉｎｇの組は、第１の文字列と第２の文字列とから構成される組に該当する。図２（ｂ）においても同様である。 More specifically, the joint score calculation unit 10 adds up the appearance counts a11, a12, a21, and a22 for each set of k-string and v-string as shown in FIG. “K-string” is an N-gram and corresponds to the “first character string” described above, and “v-string” is a character string that is a determination target of whether to join to k-string. This corresponds to the above-mentioned “second character string”. That is, a set of k-string and v-string corresponds to a set composed of a first character string and a second character string. The same applies to FIG.

“ａ１１”は、文章内においてｋ−ｓｔｒｉｎｇにｖ−ｓｔｒｉｎｇが隣接して出現した出現回数である。つまり、ａ１１は、文章内において第１の文字列に続いて第２の文字列が出現した上記第１の出現回数に相当する。
例えば、ｋ−ｓｔｒｉｎｇ「旧」、ｖ−ｓｔｒｉｎｇ「姓」としたとき、ラベル有データ記憶部９０に記憶されている文章データ（学習データ）内における、文字列「旧姓」の出現回数が１回であった場合、図２（ａ）の如く、ａ１１「１」となる。
なお、ａ１１において、第１の文字列および第２の文字列は、一の文字列および出現文字列に相当する。 “A11” is the number of appearances of v-string appearing adjacent to k-string in the sentence. That is, a11 corresponds to the first number of appearances in which the second character string appears after the first character string in the sentence.
For example, when k-string “old” and v-string “last name” are set, the number of occurrences of the character string “former name” in the sentence data (learning data) stored in the labeled data storage unit 90 is one. In this case, as shown in FIG. 2A, it becomes a11 “1”.
In a11, the 1st character string and the 2nd character string are equivalent to one character string and appearance character string.

“ａ１２”は、文章内においてｋ−ｓｔｒｉｎｇにｖ−ｓｔｒｉｎｇが隣接して出現しなかった回数、換言すれば、ｋ−ｓｔｒｉｎｇにｖ−ｓｔｒｉｎｇ以外の任意の文字が隣接して出現した出現回数である。つまり、ａ１２は、文章内において第１の文字列に続いて第２の文字列以外の文字列が出現した上記第２の出現回数に相当する。
例えば、ｋ−ｓｔｒｉｎｇ「旧」、ｖ−ｓｔｒｉｎｇ「姓」としたとき、ラベル有データ記憶部９０に記憶されている文章データ（学習データ）内における、文字列「旧暦」、文字列「旧モ」などの出現回数が合計３００回であった場合、図２（ａ）の如くａ１２「３００」となる。なお、文字列「旧モ」は、例えば、文字列「旧モデル」の一部である。
なお、ａ１２において、第１の文字列および第２の文字列以外の文字列は、一の文字列および出現文字列に相当する。 “A12” is the number of times v-string did not appear adjacent to k-string in the text, in other words, the number of appearances of any character other than v-string appeared adjacent to k-string. is there. That is, a12 corresponds to the second appearance count in which a character string other than the second character string appears after the first character string in the sentence.
For example, when k-string “old” and v-string “last name” are used, the character string “old calendar” and character string “old model” in the sentence data (learning data) stored in the labeled data storage unit 90 are stored. When the total number of appearances is “300”, a12 “300” is obtained as shown in FIG. The character string “old model” is, for example, a part of the character string “old model”.
In a12, character strings other than the first character string and the second character string correspond to one character string and an appearance character string.

“ａ２１”は、文章内においてｖ−ｓｔｒｉｎｇがｋ−ｓｔｒｉｎｇに隣接しなかった回数、換言すれば、ｖ−ｓｔｒｉｎｇがｋ−ｓｔｒｉｎｇ以外の任意の文字列に隣接して出現した出現回数である。つまり、ａ２１は、文章内において第１の文字列以外の文字列に続いて第２の文字列が出現した上記第３の出現回数に相当する。
例えば、ｋ−ｓｔｒｉｎｇ「旧」、ｖ−ｓｔｒｉｎｇ「姓」としたとき、ラベル有データ記憶部９０に記憶されている文章データ（学習データ）内における、文字列「の姓」、文字列「（姓」などの出現回数が合計１回であった場合、図２（ａ）の如くａ２１「１」となる。なお、文字列「（姓」は、例えば、文字列「氏（姓）」の一部である。
なお、ａ２１において、第１の文字列以外の文字列および第２の文字列は、一の文字列および出現文字列に相当する。 “A21” is the number of times v-string is not adjacent to k-string in the sentence, in other words, the number of appearances that v-string appears adjacent to any character string other than k-string. In other words, a21 corresponds to the third appearance count in which the second character string appears following the character string other than the first character string in the sentence.
For example, when k-string “old” and v-string “last name” are used, the character string “no surname” and the character string “(” in the sentence data (learning data) stored in the labeled data storage unit 90. When the total number of appearances of “last name” is one, it is a21 “1” as shown in FIG.2A.The character string “(last name) is, for example, the character string“ Mr. (last name) ”. It is a part.
In a21, the character string other than the first character string and the second character string correspond to one character string and an appearance character string.

“ａ２２”は、文章内においてｋ−ｓｔｒｉｎｇでもｖ−ｓｔｒｉｎｇでもない数、換言すれば、ｖ−ｓｔｒｉｎｇ以外の任意の文字列がｖ−ｓｔｒｉｎｇ以外の任意の文字に隣接して出現した出現回数である。つまり、ａ２２は、文章内において第１の文字列以外の文字列に続いて第２の文字列以外の文字列が出現した上記第４の出現回数に相当する。
例えば、ｋ−ｓｔｒｉｎｇ「旧」、ｖ−ｓｔｒｉｎｇ「姓」としたとき、ラベル有データ記憶部９０に記憶されている文章データ（学習データ）内における、文字列「私は」、文字列「明日」などの出現回数が合計３００回であった場合、図２（ａ）の如くａ２２「３００」となる。
なお、ａ２２においては、第１の文字列以外の文字列および第２の文字列以外の文字列は、一の文字列および出現文字列に相当する。 “A22” is a number that is neither k-string nor v-string in the sentence, in other words, the number of appearances that any character string other than v-string appears adjacent to any character other than v-string. is there. That is, a22 corresponds to the fourth appearance count in which a character string other than the second character string appears following a character string other than the first character string in the sentence.
For example, when k-string “old” and v-string “last name” are used, the character string “I am” and the character string “Tomorrow” in the sentence data (learning data) stored in the labeled data storage unit 90. When the total number of appearances such as “” is 300, a22 “300” is obtained as shown in FIG.
In a22, a character string other than the first character string and a character string other than the second character string correspond to a single character string and an appearance character string.

一の組の出現回数ａ１１、ａ１２、ａ２１、ａ２２を集計した接合スコア算出部１０は、出現回数ａ１１、ａ１２、ａ２１、ａ２２に基づいて、当該組を構成するｋ−ｓｔｒｉｎｇとｖ−ｓｔｒｉｎｇとの間の接合スコア（図２において「ｓｃｏｒｅ」と表記）を算出する。例えば、接合スコア算出部１０は、図２（ｂ）に示すように、ｋ−ｓｔｒｉｎｇ「旧」、ｖ−ｓｔｒｉｎｇ「姓」の組の出現回数ａ１１、ａ１２、ａ２１、ａ２２に基づいて、ｋ−ｓｔｒｉｎｇ「旧」とｖ−ｓｔｒｉｎｇ「姓」との間のｓｃｏｒｅ「０．３３」を算出する。 The joining score calculation unit 10 that tabulates the number of appearances a11, a12, a21, and a22 of one set is based on the number of appearances a11, a12, a21, and a22, and k-string and v-string that configure the set A junction score (indicated as “score” in FIG. 2) is calculated. For example, as illustrated in FIG. 2B, the joint score calculation unit 10 generates k− based on the number of appearances a11, a12, a21, and a22 of the set of k-string “old” and v-string “surname”. The score “0.33” between the string “old” and the v-string “surname” is calculated.

図２（ｂ）において、“ａｉｃ（ＩＭ）”は、ａ１１、ａ１２、ａ２１、ａ２２を独立現象と仮定し、算出したスコアである。具体的には、ａ１１＋ａ１２をｈ、ａ１１＋ａ２１をｋ、ａ１１＋ａ１２＋ａ２１＋ａ２２をｎとしたとき、下記式（１）により算出する。 In FIG. 2B, “aic (IM)” is a score calculated assuming that a11, a12, a21, and a22 are independent phenomena. Specifically, when a11 + a12 is h, a11 + a21 is k, and a11 + a12 + a21 + a22 is n, the calculation is performed by the following formula (1).

図２（ｂ）において、“ａｉｃ（ＤＭ）”は、ａ１１、ａ１２、ａ２１、ａ２２を独立現象と仮定し、算出したスコアである。具体的には、ａ１１をａ、ａ１２をｂ、ａ２１をｃ、ａ２２をｄ、ａ１１＋ａ１２＋ａ２１＋ａ２２をｎとしたとき、下記式（２）により算出する。 In FIG. 2B, “aic (DM)” is a score calculated assuming that a11, a12, a21, and a22 are independent phenomena. Specifically, when a11 is a, a12 is b, a21 is c, a22 is d, and a11 + a12 + a21 + a22 is n, the calculation is performed by the following formula (2).

接合スコアは、ａｉｃ（ＩＭ）およびａｉｃ（ＤＭ）から算出する。具体的には、ａ１１／（ａ１１＋ａ１２）＞ａ２１／（ａ２１＋ａ２２）のとき、下記式（３）により算出し、ａ１１／（ａ１１＋ａ１２）＜ａ２１／（ａ２１＋ａ２２）のとき、下記式（４）により算出する。 The junction score is calculated from aic (IM) and aic (DM). Specifically, when a11 / (a11 + a12)> a21 / (a21 + a22), the following equation (3) is calculated. When a11 / (a11 + a12) <a21 / (a21 + a22), the following equation (4) is calculated. .

接合スコアを算出した接合スコア算出部１０は、接合スコア記憶部９２に出力する。例えば、接合スコア算出部１０は、図２（ｃ）に示すように、組（ｋ−ｓｔｒｉｎｇ、ｖ−ｓｔｒｉｎｇ）に対応付けて、接合スコアを接合スコア記憶部９２に出力する。 The joint score calculation unit 10 that has calculated the joint score outputs the joint score to the joint score storage unit 92. For example, the joint score calculation unit 10 outputs the joint score to the joint score storage unit 92 in association with a set (k-string, v-string) as illustrated in FIG.

接合スコア記憶部９２は、接合スコア算出部１０から出力される接合スコアを記憶する。例えば、接合スコア記憶部９２は、図３に示すように、組（ｋ−ｓｔｒｉｎｇ、ｖ−ｓｔｒｉｎｇ）に対応付けて、接合スコアを記憶する。図３に示す例において、接合スコア記憶部９２は、ｋ−ｓｔｒｉｎｇ「旧」、ｖ−ｓｔｒｉｎｇ「姓」の組に対応付けてＳｃｏｒｅ「０．３３」を記憶している。なお、図３は、「旧姓は中野。」に係る各接合スコアであるが、組（旧、姓）の以外の組の接合スコアの値の記載は省略している。 The joint score storage unit 92 stores the joint score output from the joint score calculation unit 10. For example, the joint score storage unit 92 stores a joint score in association with a set (k-string, v-string) as shown in FIG. In the example illustrated in FIG. 3, the joint score storage unit 92 stores Score “0.33” in association with a set of k-string “old” and v-string “last name”. Note that FIG. 3 shows each joint score relating to “the maiden name is Nakano.”, But the description of the joint score values of a pair other than the pair (old, surname) is omitted.

単語境界推定部２０は、接合スコア算出部１０によって生成された接合スコア（即ち、接合スコア記憶部９２に記憶されている接合スコア）と、未知文字列記憶装置（非図示）に記憶されている未知文字列とから、当該未知文字列を単語毎に分割する際の文字列の単語境界を推定する。未知文字列の単語境界を推定した単語境界推定部２０は、当該単語境界にて未知文字列を分割した各単語を抽出する。未知文字列から各単語を抽出した単語境界推定部２０は、品詞無単語データとして、各単語をラベル有データ記憶部９４に記憶する。 The word boundary estimator 20 is stored in a joint score generated by the joint score calculator 10 (that is, a joint score stored in the joint score storage unit 92) and an unknown character string storage device (not shown). From the unknown character string, the word boundary of the character string when the unknown character string is divided into words is estimated. The word boundary estimation unit 20 that has estimated the word boundary of the unknown character string extracts each word obtained by dividing the unknown character string at the word boundary. The word boundary estimation unit 20 that has extracted each word from the unknown character string stores each word in the labeled data storage unit 94 as part-of-speech non-word data.

ラベル有データ記憶部９４は、単語境界推定部２０から出力される品詞無単語データを記憶する。つまり、前述のラベル有データ記憶部９０に記憶される品詞無単語データが、ユーザによって入力されたデータであるのに対し、ラベル有データ記憶部９４に記憶される品詞無単語データは、機械的（単語境界推定部２０）に出力されたデータである。ラベル有データ記憶部９４に記憶される品詞無単語データは、品詞推定装置（非図示）による品詞推定に用いられる。 The labeled data storage unit 94 stores the part of speech no-word data output from the word boundary estimation unit 20. That is, the part-of-speech no-word data stored in the labeled data storage unit 90 is data input by the user, whereas the part-of-speech no-word data stored in the labeled data storage unit 94 is mechanical. This is the data output to (word boundary estimation unit 20). The part-of-speech non-word data stored in the labeled data storage unit 94 is used for part-of-speech estimation by a part-of-speech estimation device (not shown).

単語境界存否確率算出部３０は、接合スコア記憶部９２に記憶されている接合スコアを参照し、接合スコア毎に、単語境界存否確率を算出する。単語境界存否確率とは、文字列間に単語の境界が存在（成立）するか否かを表す確率である。 The word boundary existence probability calculation unit 30 refers to the joint score stored in the joint score storage unit 92 and calculates the word boundary existence probability for each joint score. The word boundary existence probability is a probability representing whether or not a word boundary exists (establishes) between character strings.

例えば、単語境界存否確率算出部３０は、各接合スコアＥ_Ｘ（Ｘ＝１、２、…、ｎ）の事例数Ｚ_Ｘ、即ち組数Ｚ_Ｘを算出するとともに、各事例数Ｚ_Ｘにおいて単語境界が存在した回数Ａ_Ｘ、単語境界が存在しなかった回数Ｂ_Ｘを算出する。そして、単語境界存否確率算出部３０は、接合スコアＥ_Ｘ毎に、単語境界存否確率として、単語境界が存在する確率Ａ_Ｘ／Ｚ_Ｘ、および、単語境界が存在しない確率Ｂ_Ｘ／Ｚ_Ｘを算出する。即ち、単語境界存否確率算出部３０は、接合スコアＥ_１の単語境界存否確率として、Ａ_１／Ｚ_１およびＢ_１／Ｚ₁を算出し、接合スコアＥ_２の単語境界存否確率として、Ａ_２／Ｚ_２およびＢ_２／Ｚ_２を算出し、・・・、接合スコアＥ_ｎの単語境界存否確率として、Ａ_ｎ／Ｚ_ｎおよびＢ_ｎ／Ｚ_ｎを算出する。なお、単語境界存否確率算出部３０は、Ａ_Ｘ／Ｚ_ＸまたはＢ_Ｘ／Ｚ_Ｘの何れか一方のみを算出してもよい。 Words for example, word boundary existence probability calculation unit 30, each joined score _{E X (X = 1,2, ...} , n) case number _{Z X,} i.e. to calculate the number of sets _{Z X,} in each case the number _{Z X} The number of times A _X that the boundary exists and the number of times B _X that the word boundary does not exist are calculated. Then, the word boundary presence / absence probability calculating unit 30 calculates, as the word boundary existence probability, the probability A _X / Z _X that the word boundary exists and the probability B _X / Z _X that the word boundary does not exist for each joint score E _X. calculate. That is, word boundary existence probability calculation unit 30, a word boundary existence probability of the joint score _{E _1,} to calculate the A 1 / _{Z 1} and _B 1 / Z _1, as a word boundary existence probability of the joint score _{E 2,} _{A 2} / _{Z 2} and _B 2 / _{Z 2} is calculated, ..., as a word boundary existence probability of the joint score _{E _n,} to calculate the a n / _{Z n,} and _B _n / _Z n. Note that the word boundary existence probability calculation unit 30 may calculate only one of A _X / Z _{X and} B _X / Z _X.

また、例えば、単語境界存否確率算出部３０は、接合スコア毎ではなく、接合スコアの範囲（値の範囲）に応じて分類された接合スコア群Ｅ_ＧＸ（Ｘ＝１、２、…、ｍ）毎に、単語境界存否確率（Ａ_ＧＸ／Ｚ_ＧＸ、Ｂ_ＧＸ／Ｚ_ＧＸ）を算出してもよい。つまり、単語境界存否確率算出部３０は、近しい接合スコアをグループ化し（Ｅ_Ｇ１、Ｅ_Ｇ２、…）、グループ毎に、単語境界存否確率を算出してもよい。即ち、単語境界存否確率算出部３０は、接合スコア群Ｅ_Ｇ１の単語境界存否確率として、Ａ_Ｇ１／Ｚ_Ｇ１およびＢ_Ｇ１／Ｚ_Ｇ１を算出し、接合スコアＥ_Ｇ２の単語境界存否確率として、Ａ_Ｇ２／Ｚ_Ｇ２およびＢ_Ｇ２／Ｚ_Ｇ２を算出し、・・・、接合スコアＥ_ｍの単語境界存否確率として、Ａ_Ｇｍ／Ｚ_ＧｍおよびＢ_Ｇｍ／Ｚ_Ｇｍを算出する。なお、単語境界存否確率算出部３０は、Ａ_ＧＸ／Ｚ_ＧＸまたはＢ_ＧＸ／Ｚ_ＧＸの何れか一方のみを算出してもよい。なお、事例数Ｚ_ＧＸは、事例数Ｚ_Ｘよりも多いため、接合スコア群毎に単語境界存否確率を算出すれば、事例数Ｚが極端に少ない場合に生じる、妥当な単語境界存否確率が算出されないという問題を解決することができる。 In addition, for example, the word boundary existence probability calculation unit 30 does not set each junction score, but the junction score group E _GX (X = 1, 2,..., M) classified according to the range (value range) of the junction score. Each time, the word boundary existence probability (A _GX / Z _GX , B _GX / Z _GX ) may be calculated. That is, the word boundary presence / absence probability calculation unit 30 may group the close joint scores (E _G1 , E _G2 ,...) And calculate the word boundary presence / absence probability for each group. That is, the word boundary existence probability calculating unit 30 calculates A _G1 / Z _G1 and B _G1 / Z _G1 as the word boundary existence probability of the joint score group E _G1 , and sets A A as the word boundary existence probability of the joint score E _G2. _G2 / Z _G2 and B _G2 / Z _G2 are calculated, and A _Gm / Z _Gm and B _Gm / Z _Gm are calculated as word boundary existence probabilities of the joint score E _m . Note that the word boundary existence probability calculation unit 30 may calculate only one of A _GX / Z _{GX and} B _GX / Z _GX . Since the number of cases Z _GX is larger than the number of cases Z _X , if the word boundary existence probability is calculated for each joint score group, the appropriate word boundary existence probability that occurs when the number of cases Z is extremely small is calculated. The problem of not being able to be solved.

単語境界存否確率算出部３０は、接合スコア毎または接合スコア群毎に算出した単語境界存否確率を単語境界存否確率記憶部９６に記憶する。 The word boundary existence probability calculating unit 30 stores the word boundary existence probability calculated for each joint score or each joint score group in the word boundary existence probability storage unit 96.

単語境界存否確率記憶部９６は、単語境界存否確率算出部３０から出力される単語境界存否確率を記憶する。例えば、単語境界存否確率記憶部９６は、図４（ａ）に示すように、接合スコア毎の単語境界存否確率を記憶する。また、単語境界存否確率記憶部９６は、図４（ｂ）に示すように、接合スコア群毎の単語境界存否確率を記憶する。 The word boundary existence probability storage unit 96 stores the word boundary existence probability output from the word boundary existence probability calculation unit 30. For example, the word boundary existence probability storage unit 96 stores the word boundary existence probability for each joint score as shown in FIG. Further, the word boundary existence probability storage unit 96 stores the word boundary existence probability for each joint score group as shown in FIG.

抽出部４０は、外部、または、信頼性スコア付与部５０からの要求に応じて、ラベル有データ記憶部９４から単語境界の判定対象の文字列（以下、「対象文字列」という）を抽出する。具体的には、抽出部４０は、対象文字列として、単語境界に係る情報を有する文字列（ラベル情報を保持したままの文字列）をラベル有データ記憶部９４から抽出する（以下、当該抽出態様を「単語境界有抽出」という）。また、抽出部４０は、対象文字列として、単語境界に係る情報を有しない文字列（ラベル情報を切り捨てた文字列）をラベル有データ記憶部９４から抽出してもよい（以下、当該抽出態様を「単語境界無抽出」という）。なお、単語境界判定装置１は、抽出部４０が、単語境界有抽出または単語境界無抽出の何れの抽出を行うかについて、予め固定的に予め設定しておいてもよいし、外部からの入力に応じて、単語境界有抽出と単語境界無抽出とを切り替えるようにしてもよい。 In response to a request from the outside or the reliability score assigning unit 50, the extraction unit 40 extracts a character string (hereinafter referred to as “target character string”) that is a word boundary determination target from the labeled data storage unit 94. . Specifically, the extraction unit 40 extracts a character string having information related to a word boundary (a character string that retains label information) from the labeled data storage unit 94 as the target character string (hereinafter, the extraction is performed). The mode is called “extraction with word boundary”). Further, the extraction unit 40 may extract, as the target character string, a character string that does not have information related to the word boundary (a character string obtained by discarding label information) from the labeled data storage unit 94 (hereinafter, the extraction mode). Is referred to as “no word boundary extraction”). Note that the word boundary determination device 1 may preliminarily set in advance whether the extraction unit 40 performs extraction with or without word boundary, or input from the outside Depending on the case, the extraction with word boundary and the extraction without word boundary may be switched.

抽出部４０は、抽出した対象文字列（単語境界に係る情報を有する文字列または単語境界に係る情報を有しない文字列）を信頼性スコア付与部５０に出力する。 The extraction unit 40 outputs the extracted target character string (a character string having information relating to a word boundary or a character string not having information relating to a word boundary) to the reliability score assignment unit 50.

信頼性スコア付与部５０は、抽出部４０から対象文字列を取得する。対象文字列を取得した信頼性スコア付与部５０は、対象文字列に含まれる各文字列間の各接合スコアの信頼性スコアを付与する。具体的には、信頼性スコア付与部５０は、一の文字列間の接合スコアの信頼性スコアとして、単語境界存否確率記憶部９６に記憶されている各接合スコアに対応する単語境界存否確率を付与する。信頼性スコアとは、各接合スコアの信頼性を示す指標である。 The reliability score assignment unit 50 acquires the target character string from the extraction unit 40. The reliability score assigning unit 50 that has acquired the target character string assigns a reliability score of each joint score between the character strings included in the target character string. Specifically, the reliability score assigning unit 50 sets the word boundary existence probability corresponding to each joint score stored in the word boundary existence probability storage unit 96 as the reliability score of the joint score between one character string. Give. The reliability score is an index indicating the reliability of each joint score.

以下、信頼性スコア付与部５０の信頼性スコア付与機能について、単語境界存否確率算出部３０による単語境界存否確率の算出単位（接合スコア単位または接合スコア群単位）別、および、抽出部４０による抽出態様別に詳細に説明する。 Hereinafter, regarding the reliability score giving function of the reliability score giving unit 50, the word boundary existence probability calculation unit 30 calculates the word boundary existence probability by unit (joint score unit or joint score group unit) and the extraction unit 40 extracts It demonstrates in detail according to an aspect.

（接合スコア単位の単語境界存否確率の算出、かつ、単語境界有抽出の場合）
なお、信頼性スコア付与部５０は、抽出部４０から、単語境界に係る情報を有する対象文字列「／すぐ／行く／！／」（「／」は単語境界に係る情報）を取得した場合を例にして説明する。 (Calculation of word boundary existence probability in joint score unit and extraction with word boundary)
The reliability score assigning unit 50 acquires the target character string “/ immediately / going /! /” (“/” Is information related to the word boundary) having information related to the word boundary from the extraction unit 40. An example will be described.

対象文字列「／すぐ／行く／！／」を取得した信頼性スコア付与部５０は、対象文字列「／すぐ／行く／！／」に含まれる各文字列間の各接合スコアを接合スコア記憶部９２から取得する。つまり、信頼性スコア付与部５０は、１番目の文字列間（（文字無）／すぐ）の接合スコアＥ_１、２番目の文字列間（すぐ／行く）の接合スコアＥ_２の接合スコアＥ_２、３番目の文字列間（行く／！）の接合スコアＥ_３、４番目の文字列間（！／（文字無））の接合スコアＥ_４を接合スコア記憶部９２から取得する。 The reliability score assigning unit 50 that has acquired the target character string “/ immediate / go /! /” Stores the joint score between the character strings included in the target character string “/ immediate / go /! /”. Obtained from the unit 92. That is, reliability scoring unit 50, the first inter-string ((character Mu) / immediately) joined score E _{1 of} the second inter-string (immediately / Go) joined score E bonding score E ₂ of _2, the third obtaining between strings (go /!) joined score _E 3 of, inter 4th string (! / (character Mu)) joined score _{E 4} from the joint score storage unit 92.

各文字列間の各接合スコアを取得した信頼性スコア付与部５０は、１〜４番目の各文字列間の各接合スコアＥ_１、Ｅ_２、Ｅ_３、Ｅ_４に対応する単語境界存否確率を単語境界存否確率記憶部９６から取得し、信頼性スコアとして各接合スコアＥ_１、Ｅ_２、Ｅ_３、Ｅ_４に付与する。 The reliability score assigning unit 50 that has acquired the joint scores between the character strings has word boundary existence probabilities corresponding to the joint scores E ₁ , E ₂ , E ₃ , and E ₄ between the _{first to} fourth character strings. Is obtained from the word boundary presence / absence probability storage unit 96, and is given to each joint score E ₁ , E ₂ , E ₃ , E ₄ as a reliability score.

即ち、単語境界存否確率記憶部９６には接合スコア毎の単語境界存否確率が記憶されているが、例えば、単語境界存否確率記憶部９６にＥ_１の単語境界存否確率として、Ａ_１／Ｚ_１とＢ_１／Ｚ_１とが記憶されている場合、１番目の文字列間の接合スコアＥ_１にＡ_１／Ｚ_１とＢ_１／Ｚ_１とを付与し、単語境界存否確率記憶部９６にＥ_２の単語境界存否確率として、Ａ_２／Ｚ_２とＢ_２／Ｚ_２とが記憶されている場合、２番目の文字列間の接合スコアＥ_２にＡ_２／Ｚ_２とＢ_２／Ｚ_２とを付与し、…、単語境界存否確率記憶部９６にＥ_４の単語境界存否確率として、Ａ_４／Ｚ_４とＢ_４／Ｚ_４とが記憶されている場合、４番目の文字列間の接合スコアＥ_４にＡ_４／Ｚ_４とＢ_４／Ｚ_４とを付与する。 That is, the word boundary existence probability storage unit 96 word boundary existence probability of each joint score is stored, for example, the word boundary existence probability storage unit 96 as a word boundary existence probability of E _{_1,} A ₁ _/ Z ₁ And B ₁ / Z ₁ are stored, A ₁ / Z ₁ and B ₁ / Z ₁ are assigned to the joint score E ₁ between the first character strings, and the word boundary existence probability storage unit 96 as word boundary existence probability of E _{_2,} a 2 / _{Z 2} and _B 2 / _{if Z 2} and are stored, the joint score _{E 2} between second string _a 2 / _{Z 2} and _B 2 / Z ₂ ,..., When A ₄ / Z ₄ and B ₄ / Z ₄ are stored as the word boundary existence probability of E _{4 in} the word boundary existence probability storage unit 96, between the fourth character strings A ₄ / Z ₄ and B ₄ / Z ₄ are given to the joining score E ₄ of

（接合スコア群単位の単語境界存否確率の算出、かつ、単語境界有抽出の場合）
なお、信頼性スコア付与部５０は、上記同様、対象文字列「／すぐ／行く／！／」を取得した場合を例にして説明する。 (Calculation of word boundary existence probability in joint score group unit and extraction with word boundary)
The reliability score assigning unit 50 will be described by taking as an example the case where the target character string “/ immediate / go /! /” Is acquired as described above.

対象文字列「／すぐ／行く／！／」を取得した信頼性スコア付与部５０は、接合スコア単位の算出、かつ、単語境界有抽出の場合と同様、１番目の文字列間（（文字無）／すぐ）の接合スコアＥ_１、２番目の文字列間（すぐ／行く）の接合スコアＥ_２の接合スコアＥ_２、３番目の文字列間（行く／！）の接合スコアＥ_３、４番目の文字列間（！／（文字無））の接合スコアＥ_４を接合スコア記憶部９２から取得する。 The reliability score assigning unit 50 that acquired the target character string “/ immediately / go /! /” Calculates the joint score unit and extracts the first character string ((no character ) / immediately joining score _E 1 of), the second between strings (immediately / go joining score _E 2 of the joint score _{E 2} of), the third between strings (go /!) joining score _E 3 of 4 The joint score E ₄ between the second character strings (! / (No character)) is acquired from the joint score storage unit 92.

各文字列間の各接合スコアを取得した信頼性スコア付与部５０は、１〜４番目の各文字列間の各接合スコアＥ_１、Ｅ_２、Ｅ_３、Ｅ_４を含む接合スコア群の単語境界存否確率を単語境界存否確率記憶部９６から取得し、信頼性スコアとして各接合スコアＥ_１、Ｅ_２、Ｅ_３、Ｅ_４に付与する。 The reliability score assigning unit 50 that has acquired the joint scores between the character strings includes words of the joint score group including the joint scores E ₁ , E ₂ , E ₃ , and E ₄ between the _{first to} fourth character strings. The boundary presence / absence probability is acquired from the word boundary presence / absence probability storage unit 96, and is given to each joint score E ₁ , E ₂ , E ₃ , E ₄ as a reliability score.

即ち、単語境界存否確率記憶部９６には接合スコア群毎の単語境界存否確率が記憶されているが、例えば、単語境界存否確率記憶部９６にＥ_１を含む接合スコア群Ｅ_Ｇ１の単語境界存否確率として、Ａ_Ｇ１／Ｚ_Ｇ１とＢ_Ｇ１／Ｚ_Ｇ１とが記憶されている場合、１番目の文字列間の接合スコアＥ_１にＡ_Ｇ１／Ｚ_Ｇ１とＢ_Ｇ１／Ｚ_Ｇ１とを付与し、単語境界存否確率記憶部９６にＥ_２を含む接合スコア群Ｅ_Ｇ２の単語境界存否確率として、Ａ_Ｇ２／Ｚ_Ｇ２とＢ_Ｇ２／Ｚ_Ｇ２とが記憶されている場合、２番目の文字列間の接合スコアＥ_２にＡ_Ｇ２／Ｚ_Ｇ２とＢ_Ｇ２／Ｚ_Ｇ２とを付与し、…、単語境界存否確率記憶部９６にＥ_４を含む接合スコア群Ｅ_Ｇ４の単語境界存否確率として、Ａ_Ｇ４／Ｚ_４とＢ_Ｇ４／Ｚ_Ｇ４とが記憶されている場合、４番目の文字列間の接合スコアＥ_４にＡ_Ｇ４／Ｚ_Ｇ４とＢ_Ｇ４／Ｚ_Ｇ４とを付与する。 That is, the word boundary existence probability storage unit 96 word boundary existence probability of each joint score group is stored, for example, word boundaries existence of joint score group E _G1 comprising E ₁ word boundary existence probability storage unit 96 as a _probability, if the a G1 _{/ Z G1} and _B G1 _{/ Z G1} is stored, and _a G1 _{/ Z G1} and _B G1 _{/ Z G1} assigned to the joint score _{E 1} between first string, When A _G2 / Z _G2 and B _G2 / Z _G2 are stored as the word boundary presence / absence probabilities of the joint score group E _G2 including E ₂ in the word boundary existence probability storage unit 96, between the second character strings A _G2 / Z _G2 and B _G2 / Z _G2 are assigned to the joint score E ₂ ,..., A _G4 / as the word boundary existence probability of the joint score group E _G4 including E ₄ in the word boundary existence probability storage unit 96 Z ₄ and _B _{G4 / Z} G4 Togaki If it is, the bonding score _{E 4} between the fourth string imparting and _A G4 _{/ Z G4} and _B _{G4 / Z} G4.

（接合スコア単位の単語境界存否確率の算出、かつ、単語境界無抽出の場合）
なお、信頼性スコア付与部５０は、抽出部４０から、単語境界に係る情報を有しない対象文字列「すぐ行く！」を取得した場合を例にして説明する。 (Calculation of word boundary existence probability in joint score unit and no word boundary extraction)
The reliability score assigning unit 50 will be described by taking as an example a case where the target character string “I will go immediately” that does not have information related to the word boundary is acquired from the extracting unit 40.

対象文字列「すぐ行く！」を取得した信頼性スコア付与部５０は、対象文字列「すぐ行く！」に含まれる各文字間の各接合スコアを接合スコア記憶部９２から取得する。つまり、信頼性スコア付与部５０は、１番目の文字間（（文字無）／す）の接合スコアＥ_１’、２番目の文字間（す／ぐ）の接合スコアＥ_２’、３番目の文字列間（ぐ／行）の接合スコアＥ_３’、４番目の文字間（行／く）の接合スコアＥ_４’、５番目の文字間（く／！）の接合スコアＥ_５’、６番目の文字間（！／（文字無））の接合スコアＥ_６’を接合スコア記憶部９２から取得する。なお、６番目の文字間（！／（文字無））の接合スコアＥ_６’は、対象文字列「／すぐ／行く／！／」を取得した際の４番目の文字列間（！／（文字無））の接合スコアＥ_４と同一である。 The reliability score assigning unit 50 that has acquired the target character string “I will go immediately!” Acquires the joint score between the characters included in the target character string “I will go immediately!” From the joint score storage unit 92. In other words, the reliability score assigning unit 50 has a joint score E _{1 ′} between the first characters ((no character) / s), a joint score E _{2 ′} between the second characters (s / g), and the third between strings joining score E ₃ of the (tool / line) _', between the 4 th character (row / Ku) joined scores E _{4 of',} between the fifth character (ku /!) joining score E ₅ of the _', 6 The joint score E _{6 ′} between the second character (! / (No character)) is acquired from the joint score storage unit 92. Note that the joint score E _{6 ′} between the sixth character (! / (No character)) is the fourth character string (! / () When the target character string “/ immediate / go /! /” Is acquired. character Mu)) is identical to the joined score E ₄ of.

各文字間の各接合スコアを取得した信頼性スコア付与部５０は、１〜６番目の各文字列間の各接合スコアＥ_１’、Ｅ_２’、Ｅ_３’、Ｅ_４’、Ｅ_５’、Ｅ_６’に対応する単語境界存否確率を単語境界存否確率記憶部９６から取得し、信頼性スコアとして各接合スコアに付与する。 The reliability score assigning unit 50 that has acquired each joint score between each character, each joint score E _{1 ′} , E _{2 ′} , E _{3 ′} , E _{4 ′} , E _{5 ′} between the _{first to} sixth character strings. , E _{6 ′} , the word boundary existence probability is acquired from the word boundary existence probability storage unit 96, and is given to each joint score as a reliability score.

即ち、単語境界存否確率記憶部９６には接合スコア毎の単語境界存否確率が記憶されているが、例えば、単語境界存否確率記憶部９６にＥ_１’の単語境界存否確率として、Ａ_１’／Ｚ_１’とＢ_１’／Ｚ_１’とが記憶されている場合、１番目の文字列間の接合スコアＥ_１’にＡ_１’／Ｚ_１’とＢ_１’／Ｚ_１’とを付与し、単語境界存否確率記憶部９６にＥ_２’の単語境界存否確率として、Ａ_２’／Ｚ_２’とＢ_２’／Ｚ_２’とが記憶されている場合、２番目の文字列間の接合スコアＥ_２’にＡ_２’／Ｚ_２’とＢ_２’／Ｚ_２’とを付与し、…、単語境界存否確率記憶部９６にＥ_６’の単語境界存否確率として、Ａ_６’／Ｚ_６’とＢ_６’／Ｚ_６’とが記憶されている場合、６番目の文字列間の接合スコアＥ_６’にＡ_６’／Ｚ_６’とＢ_６’／Ｚ_６’とを付与する。 That is, the word boundary existence probability storage unit 96 word boundary existence probability of each joint score is stored, for example, the word boundary existence probability storage unit 96 E _{1 'as} a word boundary existence probability, A _1' / When Z _{1 ′} and B _{1 ′} / Z _{1 ′} are stored, A _{1 ′} / Z _{1 ′} and B _{1 ′} / Z _{1 ′} are assigned to the joint score E _{1 ′} between the first character strings. When A _{2 ′} / Z _{2 ′} and B _{2 ′} / Z _{2 ′} are stored as the word boundary existence probability of E _{2 ′} in the word boundary existence probability storage unit 96, the _second boundary between the second character strings A _{2 ′} / Z _{2 ′} and B _{2 ′} / Z _{2 ′} are given to the joint score E _{2 ′} ,..., A _{6 ′} / as the word boundary existence probability of E _{6 ′ in} the word boundary existence probability storage unit 96 When Z _{6 ′} and B _{6 ′} / Z _{6 ′} are stored, A _{6 ′} / Z _{6 ′} and B _{6 ′} / Z _{6 ′} are assigned to the joint score E _{6 ′} between the sixth character strings. To do.

（接合スコア群単位の単語境界存否確率の算出、かつ、単語境界無抽出の場合）
なお、信頼性スコア付与部５０は、上記同様、対象文字列「すぐ行く！」を取得した場合を例にして説明する。 (Calculation of word boundary existence probability in joint score group unit and no word boundary extraction)
The reliability score assigning unit 50 will be described by taking as an example the case where the target character string “I will go immediately!” Is acquired as described above.

対象文字列「すぐ行く！」を取得した信頼性スコア付与部５０は、接合スコア単位の算出、かつ、単語境界無抽出の場合と同様、１番目の文字間（（文字無）／す）の接合スコアＥ_１’、２番目の文字間（す／ぐ）の接合スコアＥ_２’、３番目の文字列間（ぐ／行）の接合スコアＥ_３’、４番目の文字間（行／く）の接合スコアＥ_４’、５番目の文字間（く／！）の接合スコアＥ_５’、６番目の文字間（！／（文字無））の接合スコアＥ_６’を接合スコア記憶部９２から取得する。 The reliability score assigning unit 50 that has acquired the target character string “I will go immediately!” Calculates the joint score unit and extracts the first character space ((no character) / su) as in the case of no word boundary extraction. I joined score E _{1 ',} between the second character (to / immediately) joined score E _2' and joined score E ₃ between the third character string (tool / row) _', between the fourth character (line / Ku ) joined score E _{4 'and} between the fifth character (ku /!) joined score E _5' and between the sixth character (! / (character Mu) joint score storage unit the joint score E _{6 ')} of 92 Get from.

各文字列間の各接合スコアを取得した信頼性スコア付与部５０は、１〜６番目の各文字列間の各接合スコアＥ_１’、Ｅ_２’、Ｅ_３’、Ｅ_４’、Ｅ_５’、Ｅ_６’を含む接合スコア群の単語境界存否確率を単語境界存否確率記憶部９６から取得し、信頼性スコアとして各接合スコアＥ_１’、Ｅ_２’、Ｅ_３’、Ｅ_４’、Ｅ_５’、Ｅ_６’に付与する。 The reliability score assigning unit 50 that has acquired the joint scores between the character strings has the joint scores E _{1 ′} , E _{2 ′} , E _{3 ′} , E _{4 ′} , E ₅ between the _{first to} sixth character strings. The word boundary existence probability of the joint score group including _' , E _6' is acquired from the word boundary existence probability storage unit 96, and each joint score E _{1 '} , E _2' , E _{3 '} , E _4' , To E _{5 ′} and E _{6 ′} .

即ち、単語境界存否確率記憶部９６には接合スコア群毎の単語境界存否確率が記憶されているが、例えば、単語境界存否確率記憶部９６にＥ_１’を含む接合スコア群Ｅ_Ｇ１’ の単語境界存否確率として、Ａ_Ｇ１’／Ｚ_Ｇ１’とＢ_Ｇ１’／Ｚ_Ｇ１’とが記憶されている場合、１番目の文字列間の接合スコアＥ_１’にＡ_Ｇ１’／Ｚ_Ｇ１’とＢ_Ｇ１’／Ｚ_Ｇ１’とを付与し、単語境界存否確率記憶部９６にＥ_２’を含む接合スコア群Ｅ_Ｇ２’の単語境界存否確率として、Ａ_Ｇ２’／Ｚ_Ｇ２’とＢ_Ｇ２’／Ｚ_Ｇ２’とが記憶されている場合、２番目の文字列間の接合スコアＥ_２’にＡ_Ｇ２’／Ｚ_Ｇ２’とＢ_Ｇ２’／Ｚ_Ｇ２’とを付与し、…、単語境界存否確率記憶部９６にＥ_６を含む接合スコア群Ｅ_Ｇ６’の単語境界存否確率として、Ａ_Ｇ６’／Ｚ_Ｇ６’とＢ_Ｇ６’／Ｚ_Ｇ６’とが記憶されている場合、４番目の文字列間の接合スコアＥ_４にＡ_Ｇ６’／Ｚ_Ｇ６’とＢ_Ｇ６’／Ｚ_Ｇ６’とを付与する。 That is, the word boundary existence probability storage unit 96 stores the word boundary existence probability for each joint score group. For example, the words in the joint score group E _{G1 ′} including E _{1 ′} in the word boundary existence probability storage unit 96 as a boundary existence probability, _{if a G1 '/} _{Z G1'} and the _{B G1 '/} _{Z G1'} is stored, the first _'to _{a G1'} joined score E ₁ between strings and / _{Z G1 'B} _{G1 ′} / Z _{G1 ′} is assigned, and the word boundary existence probability of the joint score group E _{G2 ′} including E _{2 ′} in the word boundary existence probability storage unit 96 is set as A _{G2 ′} / Z _{G2 ′} and B _{G2 ′} / Z. _{When G2 ′} is stored, A _{G2 ′} / Z _{G2 ′} and B _{G2 ′} / Z _{G2 ′} are assigned to the joint score E _{2 ′} between the second character strings, and the word boundary existence probability storage is performed. As the word boundary existence probability of the joint score group E _{G6 ′} including E ₆ in the part 96, A _{G6 ′} / Z _{G6 ′} and B _{G When 6 ′} / Z _{G6 ′} is stored, A _{G6 ′} / Z _{G6 ′} and B _{G6 ′} / Z _{G6 ′} are assigned to the joint score E ₄ between the fourth character strings.

各接合スコアの信頼性スコアを付与した信頼性スコア付与部５０は、各接合スコアの信頼性スコアを学習データ更新部６０に出力する。具体的には、抽出部４０による抽出が単語境界有抽出の場合、即ち、信頼性スコア付与部５０は、単語境界に係る情報を有する対象文字列を抽出部４０から取得し信頼性スコアを付与した場合には、当該対象文字列に含まれる各文字列間の各接合スコアの信頼性スコアを学習データ更新部６０に出力する。一方、抽出部４０による抽出が単語境界無抽出の場合、即ち、信頼性スコア付与部５０は、単語境界に係る情報を有しない対象文字列を抽出部４０から取得し信頼性スコアを付与した場合には、当該対象文字列に含まれる各文字列間の各接合スコアおよび各接合スコアの信頼性スコアを学習データ更新部６０に出力する。 The reliability score assigning unit 50 to which the reliability score of each joint score is assigned outputs the reliability score of each joint score to the learning data update unit 60. Specifically, when extraction by the extraction unit 40 is extraction with a word boundary, that is, the reliability score assigning unit 50 acquires a target character string having information related to the word boundary from the extraction unit 40 and assigns a reliability score. If it is, the reliability score of each joint score between the character strings included in the target character string is output to the learning data update unit 60. On the other hand, when extraction by the extraction unit 40 is no word boundary extraction, that is, the reliability score assigning unit 50 obtains a target character string that does not have information related to the word boundary from the extraction unit 40 and assigns a reliability score. Are output to the learning data update unit 60 each joint score between the character strings included in the target character string and the reliability score of each joint score.

学習データ更新部６０は、対象文字列に含まれる各文字列間の各接合スコアの信頼性スコアを信頼性スコア付与部５０から取得する。具体的には、抽出部４０による抽出が単語境界有抽出の場合、学習データ更新部６０は、対象文字列に含まれる各文字列間の各接合スコアの信頼性スコアを信頼性スコア付与部５０から取得する。一方、抽出部４０による抽出が単語境界無抽出の場合、学習データ更新部６０は、対象文字列に含まれる各文字列間の各接合スコアおよび各接合スコアの信頼性スコアを信頼性スコア付与部５０から取得する。 The learning data update unit 60 acquires the reliability score of each joint score between the character strings included in the target character string from the reliability score assigning unit 50. Specifically, when the extraction by the extraction unit 40 is extraction with a word boundary, the learning data update unit 60 sets the reliability score of each joint score between the character strings included in the target character string to the reliability score giving unit 50. Get from. On the other hand, when the extraction by the extraction unit 40 is no word boundary extraction, the learning data update unit 60 sets each joint score between character strings included in the target character string and the reliability score of each joint score as a reliability score giving unit. Get from 50.

各接合スコアの信頼性スコアを取得した学習データ更新部６０は、対象文字列内の単語境界を判定する。具体的には、学習データ更新部６０は、対象文字列に含まれる各文字列間の各接合スコアの信頼性スコアを信頼性スコア付与部５０から取得した場合（各接合スコア自体は取得しなかった場合）、各文字列間の各接合スコアの信頼性スコアに基づいて対象文字列内の単語境界を判定する。換言すれば、学習データ更新部６０は、抽出部４０による抽出が単語境界有抽出の場合、各文字列間の各接合スコアの信頼性スコアのみに基づいて対象文字列内の単語境界を判定する。 The learning data update unit 60 that has acquired the reliability score of each joint score determines a word boundary in the target character string. Specifically, when the learning data update unit 60 acquires the reliability score of each joint score between the character strings included in the target character string from the reliability score assigning unit 50 (the joint score itself is not acquired). The word boundary in the target character string is determined based on the reliability score of each joint score between the character strings. In other words, when the extraction by the extraction unit 40 is extraction with a word boundary, the learning data update unit 60 determines a word boundary in the target character string based only on the reliability score of each joint score between the character strings. .

一方、学習データ更新部６０は、対象文字列に含まれる各文字列間の各接合スコアおよび各接合スコアの信頼性スコアを信頼性スコア付与部５０から取得した場合（各接合スコア自体も取得した場合）、各文字列間の各接合スコアおよび各接合スコアの信頼性スコアに基づいて対象文字列内の単語境界を判定する。換言すれば、学習データ更新部６０は、抽出部４０による抽出が単語境界無抽出の場合、各文字列間の各接合スコアおよび信頼性スコアに基づいて対象文字列内の単語境界を判定する。 On the other hand, when the learning data update unit 60 obtains each joint score between the character strings included in the target character string and the reliability score of each joint score from the reliability score assigning unit 50 (each joint score itself is also obtained). ), A word boundary in the target character string is determined based on each joint score between the character strings and the reliability score of each joint score. In other words, when the extraction by the extraction unit 40 is no word boundary extraction, the learning data update unit 60 determines a word boundary in the target character string based on each joint score and reliability score between the character strings.

以下、信頼性スコアのみに基づいて単語境界を判定する場合、および、接合スコアおよび信頼性スコアに基づいて単語境界を判定する場合の学習データ更新部６０の単語境界の判定機能について詳細に説明する。 Hereinafter, the determination function of the word boundary of the learning data update unit 60 when determining the word boundary based only on the reliability score and when determining the word boundary based on the joint score and the reliability score will be described in detail. .

（信頼性スコアのみに基づいて単語境界を判定する場合）
なお、学習データ更新部６０は、対象文字列「／すぐ／行く／！／」に含まれる、１番目の文字列間（（文字無）／すぐ）の接合スコアＥ_１の信頼性スコア（Ａ_Ｇ１／Ｚ_Ｇ１とＢ_Ｇ１／Ｚ_Ｇ1）、２番目の文字列間（すぐ／行く）の接合スコアＥ_２の信頼性スコア（Ａ_Ｇ２／Ｚ_Ｇ２とＢ_Ｇ２／Ｚ_Ｇ２）、３番目の文字列間（行く／！）の接合スコアＥ_３の信頼性スコア（Ａ_Ｇ３／Ｚ_Ｇ３とＢ_Ｇ３／Ｚ_Ｇ３）、および、４番目の文字列間（！／（文字無））の接合スコアＥ_４の信頼性スコア（Ａ_Ｇ４／Ｚ_Ｇ４とＢ_Ｇ４／Ｚ_Ｇ４）を取得した場合を例にして説明する。
また、説明の便宜上、接合スコアＥ_１の信頼性スコアのうち、単語境界が存在する確率（上記例ではＡ_Ｇ１／Ｚ_Ｇ１）を信頼性スコアＰ（Ｅ_１）_Ａ、単語境界が存在しない確率（上記例ではＢ_Ｇ１／Ｚ_Ｇ１）を信頼性スコアＰ（Ｅ_１）_Ｂと表記し、接合スコアＥ_２の信頼性スコアのうち、単語境界が存在する確率（上記例ではＡ_Ｇ２／Ｚ_Ｇ２）を信頼性スコアＰ（Ｅ_２）_Ａ、単語境界が存在しない確率（上記例ではＢ_Ｇ２／Ｚ_Ｇ２）を信頼性スコアＰ（Ｅ_２）_Ｂと表記する。接合スコアＥ_３およびＥ_４についても同様に表記する。 (When judging word boundaries based only on reliability scores)
The learning data updating unit 60 is included in the target character string "/ immediately / go /! /", The first between the string ((character Mu) / immediately) reliability score of joining score E ₁ of (A _G1 / Z _G1 and B _G1 / Z _G1 ) Reliability score (A _G2 / Z _G2 and B _G2 / Z _G2 ) of the joint score E ₂ between the second character strings (immediate / go), the third character Reliability score (A _G3 / Z _G3 and B _G3 / Z _G3 ) of the joint score E ₃ between columns (go /!), And the joint score E between the fourth character strings (! / (No character)) A case where reliability scores of ₄ (A _G4 / Z _G4 and B _G4 / Z _G4 ) are acquired will be described as an example.
For the convenience of explanation, among the reliability scores of the joint score E ₁ , the probability that a word boundary exists (A _G1 / Z _{G1 in the} above example) is the reliability score P (E ₁ ) _A , and the probability that no word boundary exists (In the above example, B _G1 / Z _G1 ) is expressed as a reliability score P (E ₁ ) _B, and the probability that a word boundary exists in the reliability score of the joint score E ₂ (A _G2 / Z _{G2 in the} above example) ) Is represented as a reliability score P (E ₂ ) _A , and a probability that a word boundary does not exist (B _G2 / Z _{G2 in the} above example) is represented as a reliability score P (E ₂ ) _B. The joint scores E ₃ and E ₄ are similarly written.

対象文字列「／すぐ／行く／！／」に係る、各Ｅ_１〜Ｅ_４に対して、Ｐ（Ｅ_１）_Ａ、Ｐ（Ｅ_１）_Ｂ、Ｐ（Ｅ_２）_Ａ、Ｐ（Ｅ_２）_Ｂ、Ｐ（Ｅ_３）_Ａ、Ｐ（Ｅ_３）_Ｂ、Ｐ（Ｅ_４）_ＡおよびＰ（Ｅ_４）_Ｂを取得した学習データ更新部６０は、参考文献１のエントロピーの公式を用いて、具体的には、下記式（５）により、各Ｅ_ｎ（ｎ＝１、２、３、４）に対して、エントロピー値Ｉ_ｎを算出する。Ｉ_ｎの値は、曖昧性の高さを示す指標である。
（参考文献１）
M. Li and I. K. Sethi, "Confidence-Based Active Learning", IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 8, pp. 1251-1261, 2006 P (E ₁ ) _A , P (E ₁ ) _B , P (E ₂ ) _A , P (E ₂ ) for each E _{1 to} E ₄ related to the target character string “/ immediate / go /! /” The learning data update unit 60 that has acquired _B , P (E ₃ ) _A , P (E ₃ ) _B , P (E ₄ ) _A, and P (E ₄ ) _B uses the entropy formula of Reference 1 More specifically, the following equation (5), for each _E n (n = 1, 2, 3, 4), to calculate the entropy value _{I n.} The value of I _n is an index showing the height of ambiguity.
(Reference 1)
M. Li and IK Sethi, "Confidence-Based Active Learning", IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 8, pp. 1251-1261, 2006

学習データ更新部６０は、Ｉ_ｎと所定の閾値Ｃ_１との大小関係を比較する。そして、学習データ更新部６０は、Ｉ_ｎと所定の閾値Ｃ_１との大小関係に基づいて、Ｅ_ｎに係る文字列間の単語境界の存否を判定する。 Learning data updating unit 60 compares the magnitude relationship between I _n and a predetermined threshold C _1. Then, the learning data updating unit 60 on the basis of the magnitude relationship between I _n and a predetermined threshold C _1, determines the presence or absence of a word boundary between strings according to E _n.

具体的には、学習データ更新部６０は、Ｉ_ｎ＜Ｃ_１の場合には、Ｅ_ｎに係る文字列間に単語境界が存在すると判定する。例えば、Ｉ_２＜Ｃ_１である場合、２番目の文字列間（すぐ／行く）に単語境界が存在すると判定する。Ｉ_ｎの値が小さいときは、抽出部４０がラベル有データ記憶部９４から抽出した単語境界に係る情報によるＥ_ｎの信頼性（確実性）が高いからである。また、学習データ更新部６０は、Ｉ_ｎ≧Ｃ_１の場合には、Ｅ_ｎに係る文字列間に単語境界が存在しない判定する。Ｉ_ｎの値が大きいときは、抽出部４０がラベル有データ記憶部９４から抽出した単語境界に係る情報によるＥ_ｎの信頼性が高くないからである。 Specifically, the learning data updating unit 60 judges that in the case of I n _<C _1, the word boundary is present between the string according to E _n. For example, when I ₂ <C _1, it is determined that a word boundary exists between the second character strings (immediately / going). When the value of I _n is small, the extraction section 40 and the reliability of E _n by the information relating to word boundaries extracted from the label chromatic data storage unit 94 (certainty) is because high. Also, the learning data updating section 60 in the case of I _n ≧ C ₁ determines that no word boundaries exist between strings according to E _n. When the value of I _n is large, the extraction unit 40 because there is not high reliability E _n by the information relating to word boundaries extracted from the label chromatic data storage unit 94.

以上のように、学習データ更新部６０は、対象文字列に含まれる各文字列間の各接合スコアの信頼性スコアのみに基づいて対象文字列内の単語境界を判定する。 As described above, the learning data update unit 60 determines a word boundary in the target character string based only on the reliability score of each joint score between the character strings included in the target character string.

なお、上記説明では、学習データ更新部６０が、各Ｅ_１〜Ｅ_４に対して、Ｐ（Ｅ_１）_Ａ、Ｐ（Ｅ_１）_Ｂ、Ｐ（Ｅ_２）_Ａ、Ｐ（Ｅ_２）_Ｂ、Ｐ（Ｅ_３）_Ａ、Ｐ（Ｅ_３）_Ｂ、Ｐ（Ｅ_４）_ＡおよびＰ（Ｅ_４）_Ｂを取得する例を説明したが、学習データ更新部６０が、各Ｅ_１〜Ｅ_４に対して、Ｐ（Ｅ_１）_Ｂ、Ｐ（Ｅ_２）_Ｂ、Ｐ（Ｅ_３）_ＢおよびＰ（Ｅ_４）_Ｂを取得せずに、Ｐ（Ｅ_１）_Ａ、Ｐ（Ｅ_２）_Ａ、Ｐ（Ｅ_３）_ＡおよびＰ（Ｅ_４）_Ａのみを取得した場合には、下記式（６）により、エントロピー値Ｉ_ｎを算出する。 In the above description, the learning data updating unit 60 performs P (E ₁ ) _A , P (E ₁ ) _B , P (E ₂ ) _A , P (E ₂ ) _B for each of E _{1 to} E ₄ . , P (E ₃ ) _A , P (E ₃ ) _B , P (E ₄ ) _A, and P (E ₄ ) _B have been described. However, the learning data update unit 60 uses each of E _{1 to} E _4. On the other hand, without obtaining P (E ₁ ) _B , P (E ₂ ) _B , P (E ₃ ) _B and P (E ₄ ) _B , P (E ₁ ) _A , P (E ₂ ) _A , when acquiring only _{P (E} _{3) a} and _{P (E} _{4) a} is the following equation (6), to calculate the entropy value _{I n.}

（接合スコアおよび信頼性スコアに基づいて単語境界を判定する場合）
なお、学習データ更新部６０は、対象文字列「すぐ行く！」に含まれる、１番目の文字間（（文字無）／す）の接合スコアＥ_１’の信頼性スコア（Ａ_Ｇ１’／Ｚ_Ｇ１’とＢ_Ｇ１’／Ｚ_Ｇ1’）、２番目の文字列間（す／ぐ）の接合スコアＥ_２’の信頼性スコア（Ａ_Ｇ２’／Ｚ_Ｇ２’とＢ_Ｇ２’／Ｚ_Ｇ２’）、３番目の文字列間（ぐ／行）の接合スコアＥ_３’の信頼性スコア（Ａ_Ｇ３’／Ｚ_Ｇ３’とＢ_Ｇ３’／Ｚ_Ｇ３’）、４番目の文字列間（行／く）の接合スコアＥ_４’の信頼性スコア（Ａ_Ｇ４’／Ｚ_Ｇ４’とＢ_Ｇ４’／Ｚ_Ｇ４’）、５番目の文字列間（く／！）の接合スコアＥ_５’の信頼性スコア（Ａ_Ｇ５’／Ｚ_Ｇ５’とＢ_Ｇ５’／Ｚ_Ｇ５’）、および、６番目の文字列間（！／（文字無））の接合スコアＥ_６’の信頼性スコア（Ａ_Ｇ６’／Ｚ_Ｇ６’とＢ_Ｇ６’／Ｚ_Ｇ６’）を取得した場合を例にして説明する。
また、説明の便宜上、接合スコアＥ_１’の信頼性スコアのうち、単語境界が存在する確率（上記例ではＡ_Ｇ１’／Ｚ_Ｇ１’）を信頼性スコアＰ（Ｅ_１’）_Ａ、単語境界が存在しない確率（上記例ではＢ_Ｇ１’／Ｚ_Ｇ１’）を信頼性スコアＰ（Ｅ_１’）_Ｂと表記し、接合スコアＥ_２’の信頼性スコアのうち、単語境界が存在する確率（上記例ではＡ_Ｇ２’／Ｚ_Ｇ２’）を信頼性スコアＰ（Ｅ_２’）_Ａ、単語境界が存在しない確率（上記例ではＢ_Ｇ２’／Ｚ_Ｇ２’）を信頼性スコアＰ（Ｅ_２’）_Ｂと表記する。接合スコアＥ_３’〜Ｅ_６’についても同様に表記する。 (When judging word boundaries based on joint score and reliability score)
Note that the learning data update unit 60 includes the reliability score (A _{G1 ′} / Z) of the joint score E _{1 ′} between the first characters ((no character) / su) included in the target character string “I will go immediately!”. _{G1 ′} and B _{G1 ′} / Z _{G1 ′} ) and the reliability score of the joint score E _{2 ′} between the second character strings (A _{G2 ′} / Z _{G2 ′} and B _{G2 ′} / Z _{G2 ′} ) Reliability score (A _{G3 ′} / Z _{G3 ′} and B _{G3 ′} / Z _{G3 ′} ) of the joint score E _{3 ′} between the third character strings ( _G / line), between the fourth character strings (line / column) ) Joint score E _{4 ′} reliability score (A _{G4 ′} / Z _{G4 ′} and B _{G4 ′} / Z _{G4 ′} ), the reliability score of the joint score E _{5 ′} between the fifth character strings (（/!) _{(a G5 '/} _{Z G5'} and _{B G5 '/} _{Z G5'),} and, sixth between strings (! / (character Mu)) _'reliability score _{(a G6'} joined score E ₆ of / Z _{G6 '} And B _{G6 ′} / Z _{G6 ′} ) will be described as an example.
Further, for convenience of explanation, the probability that a word boundary exists (A _{G1 ′} / Z _{G1 ′ in the} above example) among the reliability scores of the joint score E _{1 ′} is represented by the reliability score P (E _{1 ′} ) _A and the word boundary. Is expressed as a reliability score P (E _{1 ′} ) _B, and the probability that a word boundary exists in the reliability score of the joint score E _{2 ′} (in the above example, B _{G1 ′} / Z _{G1 ′} ) In the above example, A _{G2 ′} / Z _{G2 ′} ) is the reliability score P (E _{2 ′} ) _A , and the probability that no word boundary exists (in the above example, B _{G2 ′} / Z _{G2 ′} ) is the reliability score P (E _{2 ′} ) Indicated as _B. The joint scores E _{3 ′ to} E _{6 ′} are similarly described.

対象文字列「すぐ行く！」に係る、各Ｅ_１〜Ｅ_６に対して、Ｐ（Ｅ_１’）_Ａ、Ｐ（Ｅ_１’）_Ｂ、Ｐ（Ｅ_２’）_Ａ、Ｐ（Ｅ_２’）_Ｂ、Ｐ（Ｅ_３’）_Ａ、Ｐ（Ｅ_３’）_Ｂ、Ｐ（Ｅ_４’）_Ａ、Ｐ（Ｅ_４’）_Ｂ、Ｐ（Ｅ_５’）_Ａ、Ｐ（Ｅ_５’）_Ｂ、Ｐ（Ｅ_６）_ＡおよびＰ（Ｅ_６’）_Ｂを取得した学習データ更新部６０は、上記式（５）により、各Ｅ_ｎ（ｎ＝１、２、３、４、５、６）に対して、エントロピー値Ｉ_ｎを算出する。 For each of E _{1 to} E ₆ related to the target character string “I will go immediately”, P (E _{1 ′} ) _A , P (E _{1 ′} ) _B , P (E _{2 ′} ) _A , P (E _{2 ′} ) _B , P (E _{3 ′} ) _A , P (E _{3 ′} ) _B , P (E _{4 ′} ) _A , P (E _{4 ′} ) _B , P (E _{5 ′} ) _A , P (E _{5 ′} ) _B , P (E ₆ ) _A and P (E _{6 ′} ) _B have been acquired, the learning data update unit 60 obtains each E _n (n = 1, 2, 3, 4, 5, 6) according to the above equation (5). respect, to calculate the entropy value _{I n.}

学習データ更新部６０は、各接合スコアＥ_ｎと所定の閾値Ｃ_２の大小関係、および、Ｉ_ｎと所定の閾値Ｃ_１との大小関係を比較する。そして、学習データ更新部６０は、各接合スコアＥ_ｎと所定の閾値Ｃ_２との大小関係、および、Ｉ_ｎと所定の閾値Ｃ_１との大小関係に基づいて、文字列間の単語境界の存否を判定する。 Learning data updating unit 60, the magnitude relationship of each joint score E _n and the predetermined threshold value C _2, and compares the magnitude relation between I _n and a predetermined threshold C _1. Then, the learning data updating unit 60, the magnitude relationship between the bonding score E _n and the predetermined threshold value C _2, and, based on the magnitude relationship between I _n and a predetermined threshold C _1, the word boundary between strings Determine if it exists.

具体的には、学習データ更新部６０は、Ｅ_ｎ＜Ｃ_２、かつ、Ｉ_ｎ＜Ｃ_１の場合には、Ｅ_ｎに係る文字列間に単語境界が存在すると判定する。例えば、Ｅ_２＜Ｃ_２、かつ、Ｉ_２＜Ｃ_１である場合、２番目の文字列間（す／ぐ）に単語境界が存在すると判定する。Ｅ_ｎの値が小さく、かつ、Ｉ_ｎの値が小さいときは、「両文字列間に単語境界が成立する事象が多い」ということの信頼性（確実性）が高いからである。また、学習データ更新部６０は、Ｅ_ｎ≧Ｃ_２、かつ、Ｉ_ｎ＜Ｃ_１の場合には、Ｅ_ｎに係る文字列間に単語境界が存在しない判定する。Ｅ_ｎの値が大きく、かつ、Ｉ_ｎの値が小さいときは、「両文字列間に単語境界が成立しない事象が多い」ということの信頼性が高いからである。また、学習データ更新部６０は、Ｉ_ｎ≧Ｃ_１の場合には、Ｅ_ｎによって示される文字列間に単語境界が存在しない判定する。Ｅ_ｎの値に関わらず、Ｉ_ｎの値が大きいときは、Ｅ_ｎの値自体の信頼性が高くないからである。 Specifically, the learning data updating unit _60, E n _{<C 2} and, _in the case of I n _{<C 1} determines the word boundary exists between the string according to _{E n.} For example, when E ₂ <C ₂ and I ₂ <C _1, it is determined that a word boundary exists between the second character strings. The value of E _n is small and, when the value of I _n is small, the "both-character words boundaries often events established between columns" of that reliability (certainty) is because high. In addition, the learning data update unit 60 determines that there is no word boundary between the character strings related to E _n when E _n ≧ C ₂ and I _n <C ₁ . The value of E _n is large and, when the value of I _n is small, it is highly reliable in that "often events word boundary is not established between the two strings." Also, the learning data updating section 60 in the case of I _n ≧ C ₁ determines that no word boundaries exist between strings indicated by E _n. Regardless of the value of E _n, when the value of I _n is large, because no reliable value itself E _n.

以上のように、学習データ更新部６０は、対象文字列に含まれる各文字列間の各接合スコア、および、各接合スコアの信頼性スコアに基づいて対象文字列内の単語境界を判定する。 As described above, the learning data update unit 60 determines a word boundary in the target character string based on each joint score between the character strings included in the target character string and the reliability score of each joint score.

対象文字列内の各単語境界の存否を判定した学習データ更新部６０は、単語境界の判定結果に基づいて、ラベル有データ記憶部９０に記憶されている品詞無単語データ（学習データ）を更新する。 The learning data updating unit 60 that has determined the presence or absence of each word boundary in the target character string updates the part-of-speech non-word data (learning data) stored in the labeled data storage unit 90 based on the determination result of the word boundary. To do.

具体的には、ラベル有データ記憶部９０に、ユーザによって入力された品詞無単語データを記憶する第１の領域と、学習データ更新部６０による判定結果に従った境界において対象文字列を分割した各文字列（各品詞無単語データ）を記憶する第２の領域とを設けておき、学習データ更新部６０は、判定結果に従った境界において対象文字列を分割した各文字列（各品詞無単語データ）を、上記第２の領域に追加する。なお、接合スコア算出部１０は、最終的には、ラベル有データ記憶部９０の第２の領域に記憶されているデータ（学習データ）を用いて、接合スコアを算出するようにすることが好ましい。これにより、ラベル有データ記憶部９４に記憶される品詞無単語データの精度が向上し、品詞推定装置（非図示）の品詞推定の精度も向上するようになる。 Specifically, the target character string is divided at the boundary in accordance with the first region storing the part-of-speech non-word data input by the user in the labeled data storage unit 90 and the determination result by the learning data update unit 60. A second area for storing each character string (each word-of-speech non-word data) is provided, and the learning data update unit 60 divides the target character string at the boundary according to the determination result (each part-of-speech-free). Word data) is added to the second area. In addition, it is preferable that the joining score calculation unit 10 ultimately calculates the joining score using data (learning data) stored in the second area of the labeled data storage unit 90. . Thereby, the accuracy of the part-of-speech non-word data stored in the labeled data storage unit 94 is improved, and the accuracy of the part-of-speech estimation of the part-of-speech estimation device (not shown) is also improved.

なお、ラベル有データ記憶部９０に、上記第２の領域を設けずに、学習データ更新部６０は、判定結果に従った境界において対象文字列を分割した各文字列（各品詞無単語データ）がラベル有データ記憶部９０に記憶されていなければ追加し、判定結果に従った境界と異なる境界において対象文字列を分割した各文字列（各品詞無単語データ）がラベル有データ記憶部９０に記憶されていれば削除するようにしてもよい。 In addition, without providing the second area in the labeled data storage unit 90, the learning data update unit 60 divides each character string (each part-of-speech non-word data) obtained by dividing the target character string at the boundary according to the determination result. Is stored in the labeled data storage unit 90, and each character string (each part-of-speech non-word data) obtained by dividing the target character string at a boundary different from the boundary according to the determination result is stored in the labeled data storage unit 90. If it is stored, it may be deleted.

以上のように、学習データ更新部６０は、単語境界の判定結果に基づいて、判定対象の文字列内の各単語の学習データへの反映の採否を決定し、単語毎に、学習データを更新する。なお、学習データ更新部６０は、単語境界の判定結果に基づいて、判定対象の文字列全体の学習データへの反映の採否を決定し、文字列毎に、学習データを更新してもよい。 As described above, the learning data update unit 60 determines whether to reflect each word in the character string to be determined to be reflected in the learning data based on the word boundary determination result, and updates the learning data for each word. To do. Note that the learning data updating unit 60 may determine whether to reflect the entire character string to be determined to be reflected in the learning data based on the determination result of the word boundary, and may update the learning data for each character string.

例えば、学習データ更新部６０は、信頼性スコアのみに基づいて単語境界を判定した場合（抽出部４０による抽出が単語境界有抽出の場合）には、一の対象文字列内の各Ｉ_ｎの平均値Ｉ_ＡＶＥを算出する。そして、学習データ更新部６０は、Ｉ_ＡＶＥと所定の閾値Ｃ_３の大小関係を比較し、Ｉ_ＡＶＥ＜Ｃ_３の場合には、対象文字列全体を学習データに反映する。Ｉ_ＡＶＥの値が小さいときは、抽出部４０がラベル有データ記憶部９４から抽出した単語境界に係る情報によるＥ_ｎの信頼性（確実性）が、対象文字列全体として平均的に高いからである。また、学習データ更新部６０は、Ｉ_ＡＶＥ≧Ｃ_３の場合には、対象文字列全体を学習データに反映しない。Ｉ_ＡＶＥの値が大きいときは、抽出部４０がラベル有データ記憶部９４から抽出した単語境界に係る情報によるＥ_ｎの信頼性が、対象文字列全体として平均的に高くないからである。 For example, the learning data updating unit 60, when it is determined word boundary based only on the confidence score (if extraction by the extraction unit 40 of the word boundaries organic extraction), each I _n in one subject string An average value I _AVE is calculated. Then, the learning data update unit 60 compares the magnitude relationship between I _AVE and a predetermined threshold C ₃ , and if I _AVE <C ₃ , reflects the entire target character string in the learning data. Is when the value of I _AVE is small, the reliability of E _n extraction unit 40 by the information related to word boundaries extracted from the label chromatic data storage unit 94 (certainty) is, since the average high overall target string is there. In addition, the learning data update unit 60 does not reflect the entire target character string in the learning data when I _AVE ≧ C ₃ . When the value of I _AVE is large, the reliability of E _n extraction unit 40 by the information related to word boundaries extracted from the label chromatic data storage unit 94, since no average high overall target string.

また、例えば、学習データ更新部６０は、接合スコアおよび信頼性スコアに基づいて単語境界を判定した場合（抽出部４０による抽出が単語境界無抽出の場合）には、一の対象文字列内の各Ｅ_ｎの平均値Ｅ_ＡＶＥ、および、各Ｉ_ｎの平均値Ｉ_ＡＶＥを算出する。そして、学習データ更新部６０は、Ｅ_ＡＶＥと所定の閾値Ｃ_４の大小関係、および、Ｉ_ＡＶＥと所定の閾値Ｃ_３の大小関係を比較し、学習データの更新の要否を判定してもよい。一例として、学習データ更新部６０は、Ｅ_ＡＶＥ≧Ｃ_４、かつ、Ｉ_ＡＶＥ＜Ｃ_３の場合には、対象文字列全体を学習データに反映するようにしてもよい。Ｅ_ＡＶＥの値が大きく、かつ、Ｉ_ＡＶＥの値が小さいときは、対象文字列全体が一塊であることの信頼性（確実性）が高いからである。 In addition, for example, when the learning data update unit 60 determines a word boundary based on the joint score and the reliability score (when extraction by the extraction unit 40 is no word boundary extraction), the learning data update unit 60 mean value _{E AVE} of each _{E n,} and calculates an average value _{I AVE} of each _{I n.} Then, the learning data update unit 60 compares the magnitude relationship between E _AVE and the predetermined threshold C _{4 and} the magnitude relationship between I _AVE and the predetermined threshold C ₃ , and determines whether or not the learning data needs to be updated. Good. As an example, the learning data update unit 60 may reflect the entire target character string in the learning data when E _AVE ≧ C ₄ and I _AVE <C ₃ . This is because when the value of E _AVE is large and the value of I _AVE is small, the reliability (certainty) that the entire target character string is a lump is high.

続いて、単語境界判定装置１の動作を説明する。図５（ａ）は単語境界存否確率記憶部９６に単語境界存否確率が記憶される迄の動作の一例を示すフローチャートである。図５（ｂ）はラベル有データ記憶部９４に品詞無単語データが記憶される迄の動作の一例を示すフローチャートである。図５（ｃ）はラベル有データ記憶部９０が更新される迄の動作の一例を示すフローチャートである。 Next, the operation of the word boundary determination device 1 will be described. FIG. 5A is a flowchart showing an example of the operation until the word boundary existence probability is stored in the word boundary existence probability storage unit 96. FIG. 5B is a flowchart showing an example of the operation until the part-of-speech no-word data is stored in the labeled data storage unit 94. FIG. 5C is a flowchart showing an example of the operation until the labeled data storage unit 90 is updated.

図５（ａ）において、接合スコア算出部１０は、ラベル有データ記憶部９０に記憶されている文章データ（学習データ）を用いて、接合スコアを算出する（ステップ１０）。接合スコアを算出した接合スコア算出部１０は、接合スコアを接合スコア記憶部９２に出力する。単語境界存否確率算出部３０は、接合スコア記憶部９２に記憶されている接合スコアを参照し、接合スコア毎または接合スコア群毎に単語境界存否確率を算出する（ステップＳ２０）。接合スコア毎または接合スコア群毎に単語境界存否確率を算出した単語境界存否確率算出部３０は、単語境界存否確率を単語境界存否確率記憶部９６に記憶する。そして、図５（ａ）に示すフローチャートは終了する。 In FIG. 5A, the joint score calculation unit 10 calculates a joint score using sentence data (learning data) stored in the labeled data storage unit 90 (step 10). The joint score calculation unit 10 that has calculated the joint score outputs the joint score to the joint score storage unit 92. The word boundary existence probability calculation unit 30 refers to the joint score stored in the joint score storage unit 92 and calculates the word boundary existence probability for each joint score or joint score group (step S20). The word boundary existence probability calculating unit 30 that calculates the word boundary existence probability for each joint score or each joint score group stores the word boundary existence probability in the word boundary existence probability storage unit 96. Then, the flowchart shown in FIG.

図５（ｂ）において、単語境界推定部２０は、接合スコア記憶部９２に記憶されている接合スコアと未知文字列とから、未知文字列の単語境界を推定し、単語境界にて未知文字列を分割した各単語を抽出する（ステップＳ１１０）。各単語を抽出した単語境界推定部２０は、品詞無単語データとして、各単語をラベル有データ記憶部９４に記憶する（ステップＳ１２０）。そして、図５（ｂ）に示すフローチャートは終了する。 In FIG. 5B, the word boundary estimation unit 20 estimates the word boundary of the unknown character string from the joint score and the unknown character string stored in the joint score storage unit 92, and the unknown character string at the word boundary. Each word obtained by dividing is extracted (step S110). The word boundary estimation unit 20 that has extracted each word stores each word in the labeled data storage unit 94 as part-of-speech non-word data (step S120). Then, the flowchart shown in FIG. 5B ends.

図５（ｃ）において、抽出部４０は、ラベル有データ記憶部９４から対象文字列を抽出する（ステップＳ２１０）。対象文字列を抽出した抽出部４０は、対象文字列を信頼性スコア付与部５０に出力する。対象文字列を取得した信頼性スコア付与部５０は、対象文字列に含まれる各文字列間の各接合スコアに、信頼性スコアを付与する（ステップＳ２２０）。具体的には、信頼性スコア付与部５０は、対象文字列に含まれる各文字列間の各接合スコアの信頼性スコアとして、単語境界存否確率記憶部９６に記憶されている各接合スコアに対応する単語境界存否確率を付与する。各接合スコアの信頼性スコアを付与した信頼性スコア付与部５０は、各接合スコアの信頼性スコアを学習データ更新部６０に出力する。 In FIG.5 (c), the extraction part 40 extracts an object character string from the labeled data storage part 94 (step S210). The extraction unit 40 that has extracted the target character string outputs the target character string to the reliability score assigning unit 50. The reliability score assigning unit 50 that has acquired the target character string assigns a reliability score to each joint score between the character strings included in the target character string (step S220). Specifically, the reliability score assigning unit 50 corresponds to each joint score stored in the word boundary existence probability storage unit 96 as the reliability score of each joint score between the character strings included in the target character string. The probability of existence / non-existence of word boundaries is given. The reliability score assigning unit 50 to which the reliability score of each joint score is assigned outputs the reliability score of each joint score to the learning data update unit 60.

各接合スコアの信頼性スコアを取得した学習データ更新部６０は、対象文字列内の単語境界を判定する（ステップＳ２３０）。対象文字列内の各単語境界の存否を判定した学習データ更新部６０は、単語境界の判定結果に基づいて、ラベル有データ記憶部９０に記憶されている品詞無単語データ（学習データ）を更新する（ステップＳ２４０）。そして、図５（ｃ）に示すフローチャートは終了する。 The learning data update unit 60 that has acquired the reliability score of each joint score determines a word boundary in the target character string (step S230). The learning data updating unit 60 that has determined the presence or absence of each word boundary in the target character string updates the part-of-speech non-word data (learning data) stored in the labeled data storage unit 90 based on the determination result of the word boundary. (Step S240). Then, the flowchart shown in FIG.

以上、本発明の実施形態による単語境界判定装置１によれば、接合スコアに対する信頼性に基づいて単語境界を判定するため、文字列内の単語境界を高い精度で判定することができるようになる。 As described above, according to the word boundary determination device 1 according to the embodiment of the present invention, since the word boundary is determined based on the reliability with respect to the joint score, the word boundary in the character string can be determined with high accuracy. .

また、単語境界の判定結果に基づいて、学習データ（ラベル有データ記憶部９０に記憶されている品詞無単語データ）を更新するため、接合スコア記憶部９２に記憶される接合スコアの値の信頼性が向上し、単語境界推定部２０による単語境界の推定の信頼性が向上し、ラベル有データ記憶部９４に記憶される品詞無単語データの信頼性が向上する。換言すれば、形態素解析を行う際に発生する未知語の単語境界の推定の精度が向上する。よって、未知語に対して割り当てるべき品詞を高精度に推定することができるようになる。 Further, since the learning data (part-of-speech nonword data stored in the labeled data storage unit 90) is updated based on the determination result of the word boundary, the reliability of the value of the joint score stored in the joint score storage unit 92 is trusted. Thus, the reliability of the word boundary estimation by the word boundary estimation unit 20 is improved, and the reliability of the part-of-speech non-word data stored in the labeled data storage unit 94 is improved. In other words, the accuracy of estimating the word boundary of an unknown word that occurs when performing morphological analysis is improved. Therefore, the part of speech to be assigned to the unknown word can be estimated with high accuracy.

一般に、ラベル有データを学習用データとして使用する教師あり学習に基づく単語境界推定方式の場合、半教師あり学習を行う際に、ラベル判定済みデータは、再帰的に学習される（取り込まれる）。しかし、誤ったラベルがデータに付与されていた場合、誤ったラベルが付与されたデータも再帰的に学習されるため、単語境界推定の精度低下が発生するという問題がある。しかしながら、上記実施形態による単語境界判定装置１を上記単語境界推定方式に適用した場合、信頼性スコアを用いて接合スコアの信頼性を評価するため、信頼性の高いラベル（確からしいラベル）が付与されたデータのみが再帰的学習が学習されるようになり、再帰的に学習しても、単語境界推定の精度低下を極力抑えることができるようになる。 In general, in the case of a word boundary estimation method based on supervised learning using labeled data as learning data, label-determined data is recursively learned (captured) when semi-supervised learning is performed. However, if an incorrect label is added to the data, the data with the incorrect label is also learned recursively, resulting in a problem that the accuracy of word boundary estimation is reduced. However, when the word boundary determination device 1 according to the above embodiment is applied to the word boundary estimation method, a reliability label (probable label) is assigned to evaluate the reliability of the joint score using the reliability score. Recursive learning is learned only for the data that has been performed, and even if recursive learning is performed, it is possible to suppress a decrease in the accuracy of word boundary estimation as much as possible.

つまり、従来、ユーザ（人手）によって入力した学習データのみを使用するのは非効率であるため、効率的にラベル判定済みデータを再帰的に学習していた。しかし、再帰的に学習すると単語境界推定の精度が低下するという問題があった。この問題に対し、単語境界判定装置１では、ラベル判定済みデータを再帰的に学習しても単語境界推定の精度低下が抑えられるため、効率的に、かつ、高精度に、単語境界推定を行うことができるようになる。 That is, conventionally, since it is inefficient to use only learning data input by a user (manual), the label-determined data is efficiently learned recursively. However, recursive learning has a problem that the accuracy of word boundary estimation decreases. In response to this problem, the word boundary determination device 1 can efficiently and highly accurately estimate the word boundary because the accuracy of the word boundary estimation can be suppressed even when the label-determined data is learned recursively. Will be able to.

なお、本発明の一実施形態による単語境界判定装置１の各処理を実行するためのプログラムをコンピュータ読み取り可能な記録媒体に記録して、当該記録媒体に記録されたプログラムをコンピュータシステムに読み込ませ、実行することにより、本発明の一実施形態による単語境界判定装置１に係る上述した種々の処理を行ってもよい。なお、ここでいう「コンピュータシステム」とは、ＯＳや周辺機器等のハードウェアを含むものであってもよい。また、「コンピュータシステム」は、ＷＷＷシステムを利用している場合であれば、ホームページ提供環境（あるいは表示環境）も含むものとする。また、「コンピュータ読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、フラッシュメモリ等の書き込み可能な不揮発性メモリ、ＣＤ−ＲＯＭ等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置のことをいう。 Note that a program for executing each process of the word boundary determination device 1 according to the embodiment of the present invention is recorded on a computer-readable recording medium, and the program recorded on the recording medium is read by a computer system. By executing, the above-described various processes related to the word boundary determination device 1 according to an embodiment of the present invention may be performed. Here, the “computer system” may include an OS and hardware such as peripheral devices. Further, the “computer system” includes a homepage providing environment (or display environment) if a WWW system is used. The “computer-readable recording medium” means a flexible disk, a magneto-optical disk, a ROM, a writable nonvolatile memory such as a flash memory, a portable medium such as a CD-ROM, a hard disk built in a computer system, etc. This is a storage device.

さらに「コンピュータ読み取り可能な記録媒体」とは、インターネット等のネットワークや電話回線等の通信回線を介してプログラムが送信された場合のサーバやクライアントとなるコンピュータシステム内部の揮発性メモリ（例えばＤＲＡＭ（Dynamic Random Access Memory））のように、一定時間プログラムを保持しているものも含むものとする。また、上記プログラムは、このプログラムを記憶装置等に格納したコンピュータシステムから、伝送媒体を介して、あるいは、伝送媒体中の伝送波により他のコンピュータシステムに伝送されてもよい。ここで、プログラムを伝送する「伝送媒体」は、インターネット等のネットワーク（通信網）や電話回線等の通信回線（通信線）のように情報を伝送する機能を有する媒体のことをいう。また、上記プログラムは、前述した機能の一部を実現するためのものであっても良い。さらに、前述した機能をコンピュータシステムにすでに記録されているプログラムとの組み合わせで実現できるもの、いわゆる差分ファイル（差分プログラム）であっても良い。 Further, the “computer-readable recording medium” means a volatile memory (for example, DRAM (Dynamic DRAM) in a computer system that becomes a server or a client when a program is transmitted through a network such as the Internet or a communication line such as a telephone line. Random Access Memory)), etc., which hold programs for a certain period of time. The program may be transmitted from a computer system storing the program in a storage device or the like to another computer system via a transmission medium or by a transmission wave in the transmission medium. Here, the “transmission medium” for transmitting the program refers to a medium having a function of transmitting information, such as a network (communication network) such as the Internet or a communication line (communication line) such as a telephone line. The program may be for realizing a part of the functions described above. Furthermore, what can implement | achieve the function mentioned above in combination with the program already recorded on the computer system, what is called a difference file (difference program) may be sufficient.

以上、この発明の実施形態について図面を参照して詳述してきたが、具体的な構成はこの実施形態に限られるものではなく、この発明の要旨を逸脱しない範囲の設計等も含まれる。 The embodiment of the present invention has been described in detail with reference to the drawings. However, the specific configuration is not limited to this embodiment, and includes designs and the like that do not depart from the gist of the present invention.

１単語境界判定装置１０接合スコア算出部１０単語境界推定部３０単語境界存否確率算出部４０抽出部５０信頼性スコア付与部６０学習データ更新部９０ラベル有データ記憶部（人手）９２接合スコア記憶部９４ラベル有データ記憶部（機械）９６単語境界存否確率記憶部 DESCRIPTION OF SYMBOLS 1 Word boundary determination apparatus 10 Joint score calculation part 10 Word boundary estimation part 30 Word boundary existence probability calculation part 40 Extraction part 50 Reliability score provision part 60 Learning data update part 90 Labeled data storage part (manual) 92 Joint score storage part 94 Labeled data storage unit (machine) 96 Word boundary existence probability storage unit

Claims

Stores word boundary presence / absence probabilities indicating the probability of presence / absence of word boundaries between character strings for each joint score indicating the degree of connection between character strings or for each joint score group classified according to the range of the joint scores. A word boundary existence probability storage unit,
A reliability score giving unit that gives a reliability score indicating the reliability of the joint score;
A determination unit that determines a word boundary in the character string using the reliability score of the joint score between character strings included in a character string to be determined as a word boundary;
The reliability score granting unit
As the reliability score of the joint score between one character string, the word boundary existence probability corresponding to the joint score between the character strings stored in the word boundary existence probability storage unit is given ,
The determination unit
The reliability of the joint score (E _n ) between the nth character strings based on the reliability score of the joint score between the character strings included in the character string to be determined for the word boundary having information related to the word boundary. Of the sex scores, when the probability that a word boundary exists is a reliability score P (E _n ) _A and the probability that no word boundary exists is a reliability score P (E _n ) _B , the joint score (E _n ) Each entropy value (I _n ) is calculated using the reliability score P (E _n ) _A and the reliability score P (E _n ) _B ,
A word boundary determination device characterized by determining the presence or absence of a word boundary in the character string based on a magnitude relationship between each calculated entropy value (I _n ) and a predetermined threshold (C ₁ ) .

Stores word boundary presence / absence probabilities indicating the probability of presence / absence of word boundaries between character strings for each joint score indicating the degree of connection between character strings or for each joint score group classified according to the range of the joint scores. A word boundary existence probability storage unit,
A reliability score giving unit that gives a reliability score indicating the reliability of the joint score;
A determination unit that determines a word boundary in the character string using the reliability score of the joint score between character strings included in a character string to be determined as a word boundary;
The reliability score granting unit
As the reliability score of the joint score between one character string, the word boundary existence probability corresponding to the joint score between the character strings stored in the word boundary existence probability storage unit is given ,
The determination unit
Based on the joint score between the characters included in the character string to be determined for the word boundary that does not have information related to the word boundary and the reliability score of the joint score, the joint score between the nth character strings (E _n ) The reliability score P (E _n ) _A is the probability that a word boundary exists, and the reliability score P (E _n ) _B is the probability that a word boundary does not exist. entropy value of each relative to _{(E n)} of _{(I n),} calculated using said confidence score _{P (E} _{n) a} and the confidence score _{P (E} _{n) B,}
Based on the magnitude relationship between each calculated entropy value (I _n ) and a predetermined threshold (C ₁ ), and the magnitude relationship between the joint score (E _n ) and the predetermined threshold (C ₂ ), A word boundary determination device characterized by determining whether or not a word boundary exists in a character string.

A joint score calculating unit for calculating the joint score between the first character string and the second character string;
The joint score calculation unit
A first number of appearances of the second character string following the first character string in the sentence, and a character string other than the second character string following the first character string in the sentence A second appearance count of the first character string, a third appearance count of the second character string that appears after the character string other than the first character string in the sentence, and the first character in the sentence. The fourth occurrence count of the occurrence of the character string other than the second character string following the character string other than the row, and the first appearance count, the second appearance count, and the third appearance count. number and on the basis of said fourth number of occurrences of the word according to claim 1 or claim 2, characterized in that to calculate the joint score between said first character string and second character string Boundary determination device.