JP5566704B2 - Word boundary judgment device - Google Patents

Word boundary judgment device Download PDF

Info

Publication number
JP5566704B2
JP5566704B2 JP2010006049A JP2010006049A JP5566704B2 JP 5566704 B2 JP5566704 B2 JP 5566704B2 JP 2010006049 A JP2010006049 A JP 2010006049A JP 2010006049 A JP2010006049 A JP 2010006049A JP 5566704 B2 JP5566704 B2 JP 5566704B2
Authority
JP
Japan
Prior art keywords
score
character string
word boundary
joint
reliability
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
JP2010006049A
Other languages
Japanese (ja)
Other versions
JP2011145885A (en
Inventor
正 柳原
一則 松本
康弘 滝嶋
和史 池田
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
KDDI Research Inc
Original Assignee
KDDI R&D Laboratories Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by KDDI R&D Laboratories Inc filed Critical KDDI R&D Laboratories Inc
Priority to JP2010006049A priority Critical patent/JP5566704B2/en
Publication of JP2011145885A publication Critical patent/JP2011145885A/en
Application granted granted Critical
Publication of JP5566704B2 publication Critical patent/JP5566704B2/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Machine Translation (AREA)

Description

本発明は、単語境界判定装置に関する。   The present invention relates to a word boundary determination device.

形態素解析において、単語として特定できない文字列(以降、「未知文字列」と呼ぶ)が出力されることが多い。一般に、形態素解析装置の主部(以下、「形態素解析エンジン」という)によって参照される辞書(以下、「形態素解析用辞書」という)に登録されていない文字列が未知文字列として出力される。   In morphological analysis, character strings that cannot be specified as words (hereinafter referred to as “unknown character strings”) are often output. In general, a character string that is not registered in a dictionary (hereinafter referred to as “dictionary for morpheme analysis”) that is referred to by the main part of the morpheme analyzer (hereinafter referred to as “morpheme analysis engine”) is output as an unknown character string.

文字列から単語を正しく特定するための技術に関し、n-gramの統計情報を用いて、未知文字列のうち、単語となる境界を推定し、単語と推定した箇所に対し、品詞を推定する方式も考えられる(非特許文献1参照)。例えば、非特許文献1に係る論文における方法では、n-gramの統計情報を用いて、文字の出現頻度から計算した確率を基にした文字間の関連度を元に、文字列から単語を生成する。その後は閾値を用いることで、単語の品詞を推定するという方式を採る。また、この他に、閾値はデータによって異なることが多いため、入力データを変更する都度、閾値を再調整する。   A method for estimating the word part of an unknown character string by using n-gram statistical information and estimating the word boundary of the unknown character string. Is also conceivable (see Non-Patent Document 1). For example, in the method in the paper related to Non-Patent Document 1, using n-gram statistical information, a word is generated from a character string based on the degree of association between characters based on the probability calculated from the appearance frequency of characters. To do. Thereafter, a method of estimating the part of speech of the word by using a threshold is adopted. In addition, since the threshold value often varies depending on the data, the threshold value is readjusted every time the input data is changed.

「nグラム統計によるコーパスからの未知語抽出」 著者 森 信介、長尾 眞、情報処理学会論文誌、Vol.95,No.168,pp.7-12,1998"Unknown word extraction from corpus by n-gram statistics" Author Shinsuke Mori, Atsushi Nagao, Transactions of Information Processing Society of Japan, Vol.95, No.168, pp.7-12, 1998 Kazunori Matsumoto, Kazuo Hashimoto, "Schema Design for Causal Law Mining from Incomplete Database", Discovery Science, Second International Conference, DS '99, Tokyo, Japan, December, 1999, Proceedings. Lecture Notes in Computer Science 1721 Springer, pp.92-102, 1999.Kazunori Matsumoto, Kazuo Hashimoto, "Schema Design for Causal Law Mining from Incomplete Database", Discovery Science, Second International Conference, DS '99, Tokyo, Japan, December, 1999, Proceedings. Lecture Notes in Computer Science 1721 Springer, pp.92 -102, 1999.

しかしながら、非特許文献1に係る論文における方法には、以下の問題がある。統計情報は確率によって表現されるが、確率を用いる場合、もともと保持していた情報量の信頼性が破棄されてしまうという問題がある。例えば、100文中10回登場した単語は、10文中1回登場した単語に比べ、情報量の観点から言えば信頼性が高いが、確率を用いる場合、共に単に確率「0.1」として取り扱われ、情報量の信頼性が破棄される。さらに、非特許文献1では、任意の文字列に後続する文字との関連を検証するが、文字列の前に存在する文字との関連も同時に検証する場合と比べ、精度が落ちてしまう欠点が挙げられる。また、閾値を使う場合では線形的に境界を判別することになるため、精度のことを踏まえ、非線形的な判別が可能な単語境界の推定方式を利用することが望ましい。   However, the method in the paper related to Non-Patent Document 1 has the following problems. Although the statistical information is expressed by a probability, there is a problem that the reliability of the amount of information originally held is discarded when the probability is used. For example, a word that appears 10 times in 100 sentences is more reliable from the viewpoint of the amount of information than a word that appears once in 10 sentences. However, when a probability is used, both words are treated simply as a probability “0.1”. The reliability of the information amount is discarded. Furthermore, in Non-Patent Document 1, the relationship with a character that follows an arbitrary character string is verified. However, there is a drawback in that the accuracy is reduced as compared with the case where the relationship with a character existing before the character string is also verified at the same time. Can be mentioned. In addition, when the threshold value is used, the boundary is determined linearly. Therefore, it is desirable to use a word boundary estimation method capable of nonlinear determination based on accuracy.

本発明は、上述した課題に鑑みてなされたものであって、高い信頼性で文字列内の単語境界を判定する技術を提供することを目的とする。   The present invention has been made in view of the above-described problems, and an object of the present invention is to provide a technique for determining a word boundary in a character string with high reliability.

上記問題を解決するために、本発明の一態様である単語境界判定装置は、文字列間の接合度を示す接合スコア毎に、または、前記接合スコアの範囲に応じて分類された接合スコア群毎に、文字列間の単語境界の存否の確率を示す単語境界存否確率を記憶する単語境界存否確率記憶部と、前記接合スコアの信頼性を示す信頼性スコアを付与する信頼性スコア付与部と、単語境界の判定対象の文字列に含まれる文字列間の前記接合スコアの前記信頼性スコアを用いて当該文字列内の単語境界を判定する判定部とを備え、前記信頼性スコア付与部は、一の文字列間の前記接合スコアの前記信頼性スコアとして、前記単語境界存否確率記憶部に記憶されている当該文字列間の前記接合スコアに対応する前記単語境界存否確率を付与することを特徴とする。 In order to solve the above problem, the word boundary determination device according to one aspect of the present invention is a joint score group classified according to a joint score indicating a joint degree between character strings or according to the range of the joint score. A word boundary presence / absence probability storage unit that stores a word boundary presence / absence probability indicating a probability of presence / absence of a word boundary between character strings, and a reliability score giving unit that provides a reliability score indicating the reliability of the joint score; A determination unit that determines a word boundary in the character string using the reliability score of the joint score between character strings included in a character string to be determined as a word boundary, and the reliability score giving unit includes The word boundary existence probability corresponding to the joint score between the character strings stored in the word boundary existence probability storage unit is assigned as the reliability score of the joint score between one character string. Features .

上記単語境界判定装置は、第1の文字列と第2の文字列との間の前記接合スコアを算出する接合スコア算出部を更に備え、前記接合スコア算出部は、文章内において前記第1の文字列に続いて前記第2の文字列が出現した第1の出現回数と、文章内において前記第1の文字列に続いて前記第2の文字列以外の文字列が出現した第2の出現回数と、文章内において前記第1の文字列以外の文字列に続いて前記第2の文字列が出現した第3の出現回数と、文章内において前記第1の文字列以外の文字列に続いて前記第2の文字列以外の文字列が出現した第4の出現回数とを集計し、前記第1の出現回数、前記第2の出現回数、前記第3の出現回数および前記第4の出現回数に基づいて、前記第1の文字列と第2の文字列との間の前記接合スコアを算出するようにしてもよい。   The said word boundary determination apparatus is further provided with the joining score calculation part which calculates the said joining score between a 1st character string and a 2nd character string, The said joining score calculation part is a said 1st character string in a sentence. A first appearance number of times the second character string appears following the character string, and a second appearance that a character string other than the second character string appears after the first character string in the sentence The number of times, the third number of appearances of the second character string following the character string other than the first character string in the sentence, and the character string other than the first character string in the sentence And the fourth appearance count of occurrence of character strings other than the second character string, and the first appearance count, the second appearance count, the third appearance count, and the fourth appearance count. Based on the number of times, the joint score between the first character string and the second character string is calculated. It may be out.

具体的には、前記接合スコア算出部は、前記第1の出現回数をa、前記第2の出現回数をb、前記第3の出現回数をc、前記第4の出現回数をd、a+bをh、a+cをk、a+b+c+dをnとしたときに、下記算術式に従って第1のスコアと第2のスコアを算出し、前記第1の文字列と第2の文字列との間の前記接合スコアとして、前記第1のスコアと前記第2のスコアの差を算出してもよい。
(第1のスコアの算術式)
第1のスコア=−2×{hlogh+klogk+(n−h)log(n−h)+(n−k)log(n−k)−2nlogn}+2×3
(第2のスコアの算術式)
第2のスコア=−2×{aloga+blogb+clogc+dlogd−nlogn}+2×2
Specifically, the joint score calculation unit sets the first appearance count to a, the second appearance count to b, the third appearance count to c, the fourth appearance count to d, and a + b. When h, a + c is k, and a + b + c + d is n, the first score and the second score are calculated according to the following arithmetic expression, and the joint score between the first character string and the second character string As an alternative, the difference between the first score and the second score may be calculated.
(Arithmetic formula for the first score)
First score = −2 × {hlog + klogk + (n−h) log (n−h) + (n−k) log (n−k) −2nlogn} + 2 × 3
(Arithmetic formula for the second score)
Second score = −2 × {loga + blogb + logc + dlogd−nlogn} + 2 × 2

前記判定部は、単語境界に係る情報を有する単語境界の判定対象の文字列に含まれる文字列間の前記接合スコアの前記信頼性スコアに基づいて、n番目の文字列間の前記接合スコア(E )の信頼性スコアのうち、単語境界が存在する確率を信頼性スコアP(E 、単語境界が存在しない確率を信頼性スコアP(E としたときに、前記接合スコア(E )に対して夫々のエントロピー値(I )を、前記信頼性スコアP(E と前記信頼性スコアP(E とを用いて算出し、算出した夫々のエントロピー値(I )と所定の閾値(C )との大小関係に基づいて、当該文字列内の単語境界の存否を判定するようにしてもよい。 The said determination part is based on the said reliability score of the said joint score between the character strings contained in the character string of the judgment object of the word boundary which has the information which concerns on a word boundary, The said joining score between nth character strings ( among the confidence score of E n), the probability that the word boundaries exist confidence score P (E n) a, the probability that a word boundary is not present when the confidence score P (E n) B, the joint Each entropy value (I n ) is calculated with respect to the score (E n ) using the reliability score P (E n ) A and the reliability score P (E n ) B. Based on the magnitude relationship between the entropy value (I n ) and the predetermined threshold (C 1 ), the presence or absence of a word boundary in the character string may be determined.

前記判定部は、単語境界に係る情報を有しない単語境界の判定対象の文字列に含まれる文字間の前記接合スコアおよび前記接合スコアの前記信頼性スコアに基づいて、n番目の文字列間の前記接合スコア(E )の信頼性スコアのうち、単語境界が存在する確率を信頼性スコアP(E 、単語境界が存在しない確率を信頼性スコアP(E としたときに、前記接合スコア(E )に対して夫々のエントロピー値(I )を、前記信頼性スコアP(E と前記信頼性スコアP(E とを用いて算出し、算出した夫々のエントロピー値(I )と所定の閾値(C )との大小関係、および、前記接合スコア(E )と所定の閾値(C )との大小関係とに基づいて、当該文字列内の単語境界の存否を判定するようにしてもよい。 The determination unit determines whether the n-th character string is based on the joint score between the characters included in the character string to be determined for the word boundary that does not have information related to the word boundary and the reliability score of the joint score . among the confidence score of the joint score (E n), the probability that the word boundaries exist confidence score P (E n) a, when the probability that a word boundary does not exist and the reliability score P (E n) B And calculating each entropy value (I n ) for the junction score (E n ) using the reliability score P (E n ) A and the reliability score P (E n ) B , Based on the magnitude relationship between each calculated entropy value (I n ) and a predetermined threshold (C 1 ), and the magnitude relationship between the joint score (E n ) and the predetermined threshold (C 2 ), Determine if there is a word boundary in the string Unishi may be.

単語境界の判定結果に基づいて、前記学習データを更新する更新部を更に備えるようにしてもよい。 You may make it further provide the update part which updates the said learning data based on the determination result of a word boundary .

単語境界の判定結果に基づいて、前記学習データを更新する更新部を備え、記更新部は、一の対象文字列内の各エントロピー値(I )の平均値I AVE と所定の閾値(C )の大小関係を比較し、平均値I AVE <閾値(C )の場合には、当該対象文字列を学習データに更新するようにしてもよい。また、一の対象文字列内の前記接合スコア(E )の平均値E AVE と所定の閾値(C )の大小関係を比較するとともに、当該対象文字列内の各エントロピー値(I )の平均値I AVE と所定の閾値(C )の大小関係を比較し、平均値E AVE ≧閾値(C )、かつ、I AVE <閾値(C )の場合には、当該対象文字列を学習データに更新するようにしてもよい。 Based on the determination result of word boundaries, it includes an update unit for updating the training data, prior Symbol update unit, the average value I AVE with a predetermined threshold of each entropy value in one subject string (I n) The magnitude relationship of (C 3 ) is compared, and when the average value I AVE <threshold (C 3 ), the target character string may be updated to learning data. Further, the magnitude relationship between the average value E AVE of the joint score (E n ) in one target character string and a predetermined threshold (C 4 ) is compared, and each entropy value (I n ) in the target character string is compared. The average value I AVE is compared with a predetermined threshold (C 3 ), and if the average value E AVE ≧ threshold (C 4 ) and I AVE <threshold (C 3 ), the target character string May be updated to learning data.

本発明によれば、高い信頼性で文字列内の単語境界を判定することができるようになる。   According to the present invention, a word boundary in a character string can be determined with high reliability.

本発明の第1の実施形態による単語境界判定装置の機能ブロック図の一例である。It is an example of the functional block diagram of the word boundary determination apparatus by the 1st Embodiment of this invention. 接合スコア算出部による接合スコアの生成過程を説明する図である。It is a figure explaining the production | generation process of the joining score by a joining score calculation part. 接合スコア記憶部に記憶されている情報の一例である。It is an example of the information memorize | stored in the joining score memory | storage part. 単語境界存否確率記憶部に記憶されている情報の一例である。It is an example of the information memorize | stored in the word boundary existence probability memory | storage part. 単語境界判定装置の動作の一例を示すフローチャートである。It is a flowchart which shows an example of operation | movement of a word boundary determination apparatus.

以下、本発明の第1の実施形態について図面を参照して詳細に説明する。本発明の第1の実施形態による単語境界判定装置1は、図1に示すように、接合スコア算出部10、単語境界推定部20、単語境界存否確率算出部30、抽出部40、信頼性スコア付与部50、学習データ更新部60、ラベル有データ記憶部90、接合スコア記憶部92、ラベル有データ記憶部94および単語境界存否確率記憶部96を備える。   Hereinafter, a first embodiment of the present invention will be described in detail with reference to the drawings. As shown in FIG. 1, the word boundary determination device 1 according to the first exemplary embodiment of the present invention includes a joint score calculation unit 10, a word boundary estimation unit 20, a word boundary existence probability calculation unit 30, an extraction unit 40, and a reliability score. A provision unit 50, a learning data update unit 60, a labeled data storage unit 90, a joint score storage unit 92, a labeled data storage unit 94, and a word boundary existence probability storage unit 96 are provided.

ラベル有データ記憶部90は、単語境界を含む文章データを記憶する。ラベル有データ記憶部90に記憶される文章データは、学習データとして、ユーザによって入力された品詞無単語データである。また、ラベル有データ記憶部90に記憶される文章データは、未知文字列を多く含むものであることが好ましい。   The labeled data storage unit 90 stores sentence data including word boundaries. The sentence data stored in the labeled data storage unit 90 is part-of-speech non-word data input by the user as learning data. Moreover, it is preferable that the sentence data memorize | stored in the labeled data storage part 90 contain many unknown character strings.

接合スコア算出部10は、ラベル有データ記憶部90に記憶されている文章データ(学習データ)を用いて、接合スコアを算出する。接合スコアとは、文字列間の接合度を示す指標である。接合スコアは、学習データとして与えられる文章に含まれる文字列(1以上の文字から構成される文字列)を対象として、当該文章中において当該文字列の前後に出現する文字の分布を集計して算出される。接合スコアの値は、文章内において、ある文字列と隣接する他の文字列の間に単語境界が成立しない事象が多いほど大きい。つまり、接合スコアの値が大きければ大きいほど、両文字列間に単語境界が成立し難いことを意味する。   The joint score calculation unit 10 calculates a joint score using sentence data (learning data) stored in the labeled data storage unit 90. The joining score is an index indicating the degree of joining between character strings. The joint score is obtained by counting the distribution of characters appearing before and after the character string in the sentence for a character string (character string composed of one or more characters) included in the sentence given as learning data. Calculated. The value of the joint score increases as the number of events in which a word boundary is not established between a certain character string and another adjacent character string in the sentence increases. In other words, the larger the value of the joint score, the more difficult it is to establish a word boundary between both character strings.

以下、接合スコア算出部10の接合スコア算出機能について詳細に説明する。接合スコア算出部10は、文章内の一の文字列と、当該文章内の当該一の文字列の前後に出現する出現文字列とから構成される組別に、当該文章内における前記出現文字列の出現回数を集計し、組別の出現回数に基づいて、当該一の文字列と出現文字列との間の接合スコアを算出する。具体的には、接合スコア算出部10は、モデル検定による評価手法を活用し、文字(列)間の関連度(接合度)を計測する。   Hereinafter, the joining score calculation function of the joining score calculation unit 10 will be described in detail. The joint score calculation unit 10 divides the appearance character string in the sentence into a set composed of one character string in the sentence and appearance character strings appearing before and after the one character string in the sentence. The number of appearances is totaled, and a joint score between the one character string and the appearance character string is calculated based on the number of appearances for each group. Specifically, the joint score calculation unit 10 measures the degree of association (joint degree) between characters (columns) using an evaluation method based on a model test.

具体的には、接合スコア算出部10は、第1の文字列と第2の文字列との間の接合スコアを算出する場合、文章内において第1の文字列に続いて第2の文字列が出現した第1の出現回数と、文章内において第1の文字列に続いて第2の文字列以外の文字列が出現した第2の出現回数と、文章内において第1の文字列以外の文字列に続いて第2の文字列が出現した第3の出現回数と、文章内において第1の文字列以外の文字列に続いて第2の文字列以外の文字列が出現した第4の出現回数とを集計し、第1の出現回数、第2の出現回数、第3の出現回数および第4の出現回数に基づいて、第1の文字列と第2の文字列との間の接合スコアを算出する。   Specifically, when the joint score calculation unit 10 calculates the joint score between the first character string and the second character string, the second character string follows the first character string in the sentence. The first number of appearances, the second number of appearances of a character string other than the second character string following the first character string in the sentence, and the number of occurrences other than the first character string in the sentence A third appearance number of times the second character string appears following the character string, and a fourth occurrence number of the character string other than the second character string following the character string other than the first character string in the sentence. The number of appearances is totaled, and the connection between the first character string and the second character string based on the first appearance number, the second appearance number, the third appearance number, and the fourth appearance number Calculate the score.

より詳細には、接合スコア算出部10は、k−stringとv−stringの組毎に、図2(a)に示すように、出現回数a11、a12、a21、a22を集計する。“k−string”はN−gramであって上述の「第1の文字列」に該当し、“v−string”はk−stringに対し、接合すべきかの判定対象である文字列であって上述の「第2の文字列」に該当する。つまり、k−stringとv−stringの組は、第1の文字列と第2の文字列とから構成される組に該当する。図2(b)においても同様である。   More specifically, the joint score calculation unit 10 adds up the appearance counts a11, a12, a21, and a22 for each set of k-string and v-string as shown in FIG. “K-string” is an N-gram and corresponds to the “first character string” described above, and “v-string” is a character string that is a determination target of whether to join to k-string. This corresponds to the above-mentioned “second character string”. That is, a set of k-string and v-string corresponds to a set composed of a first character string and a second character string. The same applies to FIG.

“a11”は、文章内においてk−stringにv−stringが隣接して出現した出現回数である。つまり、a11は、文章内において第1の文字列に続いて第2の文字列が出現した上記第1の出現回数に相当する。
例えば、k−string「旧」、v−string「姓」としたとき、ラベル有データ記憶部90に記憶されている文章データ(学習データ)内における、文字列「旧姓」の出現回数が1回であった場合、図2(a)の如く、a11「1」となる。
なお、a11において、第1の文字列および第2の文字列は、一の文字列および出現文字列に相当する。
“A11” is the number of appearances of v-string appearing adjacent to k-string in the sentence. That is, a11 corresponds to the first number of appearances in which the second character string appears after the first character string in the sentence.
For example, when k-string “old” and v-string “last name” are set, the number of occurrences of the character string “former name” in the sentence data (learning data) stored in the labeled data storage unit 90 is one. In this case, as shown in FIG. 2A, it becomes a11 “1”.
In a11, the 1st character string and the 2nd character string are equivalent to one character string and appearance character string.

“a12”は、文章内においてk−stringにv−stringが隣接して出現しなかった回数、換言すれば、k−stringにv−string以外の任意の文字が隣接して出現した出現回数である。つまり、a12は、文章内において第1の文字列に続いて第2の文字列以外の文字列が出現した上記第2の出現回数に相当する。
例えば、k−string「旧」、v−string「姓」としたとき、ラベル有データ記憶部90に記憶されている文章データ(学習データ)内における、文字列「旧暦」、文字列「旧モ」などの出現回数が合計300回であった場合、図2(a)の如くa12「300」となる。なお、文字列「旧モ」は、例えば、文字列「旧モデル」の一部である。
なお、a12において、第1の文字列および第2の文字列以外の文字列は、一の文字列および出現文字列に相当する。
“A12” is the number of times v-string did not appear adjacent to k-string in the text, in other words, the number of appearances of any character other than v-string appeared adjacent to k-string. is there. That is, a12 corresponds to the second appearance count in which a character string other than the second character string appears after the first character string in the sentence.
For example, when k-string “old” and v-string “last name” are used, the character string “old calendar” and character string “old model” in the sentence data (learning data) stored in the labeled data storage unit 90 are stored. When the total number of appearances is “300”, a12 “300” is obtained as shown in FIG. The character string “old model” is, for example, a part of the character string “old model”.
In a12, character strings other than the first character string and the second character string correspond to one character string and an appearance character string.

“a21”は、文章内においてv−stringがk−stringに隣接しなかった回数、換言すれば、v−stringがk−string以外の任意の文字列に隣接して出現した出現回数である。つまり、a21は、文章内において第1の文字列以外の文字列に続いて第2の文字列が出現した上記第3の出現回数に相当する。
例えば、k−string「旧」、v−string「姓」としたとき、ラベル有データ記憶部90に記憶されている文章データ(学習データ)内における、文字列「の姓」、文字列「(姓」などの出現回数が合計1回であった場合、図2(a)の如くa21「1」となる。なお、文字列「(姓」は、例えば、文字列「氏(姓)」の一部である。
なお、a21において、第1の文字列以外の文字列および第2の文字列は、一の文字列および出現文字列に相当する。
“A21” is the number of times v-string is not adjacent to k-string in the sentence, in other words, the number of appearances that v-string appears adjacent to any character string other than k-string. In other words, a21 corresponds to the third appearance count in which the second character string appears following the character string other than the first character string in the sentence.
For example, when k-string “old” and v-string “last name” are used, the character string “no surname” and the character string “(” in the sentence data (learning data) stored in the labeled data storage unit 90. When the total number of appearances of “last name” is one, it is a21 “1” as shown in FIG.2A.The character string “(last name) is, for example, the character string“ Mr. (last name) ”. It is a part.
In a21, the character string other than the first character string and the second character string correspond to one character string and an appearance character string.

“a22”は、文章内においてk−stringでもv−stringでもない数、換言すれば、v−string以外の任意の文字列がv−string以外の任意の文字に隣接して出現した出現回数である。つまり、a22は、文章内において第1の文字列以外の文字列に続いて第2の文字列以外の文字列が出現した上記第4の出現回数に相当する。
例えば、k−string「旧」、v−string「姓」としたとき、ラベル有データ記憶部90に記憶されている文章データ(学習データ)内における、文字列「私は」、文字列「明日」などの出現回数が合計300回であった場合、図2(a)の如くa22「300」となる。
なお、a22においては、第1の文字列以外の文字列および第2の文字列以外の文字列は、一の文字列および出現文字列に相当する。
“A22” is a number that is neither k-string nor v-string in the sentence, in other words, the number of appearances that any character string other than v-string appears adjacent to any character other than v-string. is there. That is, a22 corresponds to the fourth appearance count in which a character string other than the second character string appears following a character string other than the first character string in the sentence.
For example, when k-string “old” and v-string “last name” are used, the character string “I am” and the character string “Tomorrow” in the sentence data (learning data) stored in the labeled data storage unit 90. When the total number of appearances such as “” is 300, a22 “300” is obtained as shown in FIG.
In a22, a character string other than the first character string and a character string other than the second character string correspond to a single character string and an appearance character string.

一の組の出現回数a11、a12、a21、a22を集計した接合スコア算出部10は、出現回数a11、a12、a21、a22に基づいて、当該組を構成するk−stringとv−stringとの間の接合スコア(図2において「score」と表記)を算出する。例えば、接合スコア算出部10は、図2(b)に示すように、k−string「旧」、v−string「姓」の組の出現回数a11、a12、a21、a22に基づいて、k−string「旧」とv−string「姓」との間のscore「0.33」を算出する。   The joining score calculation unit 10 that tabulates the number of appearances a11, a12, a21, and a22 of one set is based on the number of appearances a11, a12, a21, and a22, and k-string and v-string that configure the set A junction score (indicated as “score” in FIG. 2) is calculated. For example, as illustrated in FIG. 2B, the joint score calculation unit 10 generates k− based on the number of appearances a11, a12, a21, and a22 of the set of k-string “old” and v-string “surname”. The score “0.33” between the string “old” and the v-string “surname” is calculated.

図2(b)において、“aic(IM)”は、a11、a12、a21、a22を独立現象と仮定し、算出したスコアである。具体的には、a11+a12をh、a11+a21をk、a11+a12+a21+a22をnとしたとき、下記式(1)により算出する。   In FIG. 2B, “aic (IM)” is a score calculated assuming that a11, a12, a21, and a22 are independent phenomena. Specifically, when a11 + a12 is h, a11 + a21 is k, and a11 + a12 + a21 + a22 is n, the calculation is performed by the following formula (1).

Figure 0005566704
Figure 0005566704

図2(b)において、“aic(DM)”は、a11、a12、a21、a22を独立現象と仮定し、算出したスコアである。具体的には、a11をa、a12をb、a21をc、a22をd、a11+a12+a21+a22をnとしたとき、下記式(2)により算出する。   In FIG. 2B, “aic (DM)” is a score calculated assuming that a11, a12, a21, and a22 are independent phenomena. Specifically, when a11 is a, a12 is b, a21 is c, a22 is d, and a11 + a12 + a21 + a22 is n, the calculation is performed by the following formula (2).

Figure 0005566704
Figure 0005566704

接合スコアは、aic(IM)およびaic(DM)から算出する。具体的には、a11/(a11+a12)>a21/(a21+a22)のとき、下記式(3)により算出し、a11/(a11+a12)<a21/(a21+a22)のとき、下記式(4)により算出する。   The junction score is calculated from aic (IM) and aic (DM). Specifically, when a11 / (a11 + a12)> a21 / (a21 + a22), the following equation (3) is calculated. When a11 / (a11 + a12) <a21 / (a21 + a22), the following equation (4) is calculated. .

Figure 0005566704
Figure 0005566704

接合スコアを算出した接合スコア算出部10は、接合スコア記憶部92に出力する。例えば、接合スコア算出部10は、図2(c)に示すように、組(k−string、v−string)に対応付けて、接合スコアを接合スコア記憶部92に出力する。   The joint score calculation unit 10 that has calculated the joint score outputs the joint score to the joint score storage unit 92. For example, the joint score calculation unit 10 outputs the joint score to the joint score storage unit 92 in association with a set (k-string, v-string) as illustrated in FIG.

接合スコア記憶部92は、接合スコア算出部10から出力される接合スコアを記憶する。例えば、接合スコア記憶部92は、図3に示すように、組(k−string、v−string)に対応付けて、接合スコアを記憶する。図3に示す例において、接合スコア記憶部92は、k−string「旧」、v−string「姓」の組に対応付けてScore「0.33」を記憶している。なお、図3は、「旧姓は中野。」に係る各接合スコアであるが、組(旧、姓)の以外の組の接合スコアの値の記載は省略している。   The joint score storage unit 92 stores the joint score output from the joint score calculation unit 10. For example, the joint score storage unit 92 stores a joint score in association with a set (k-string, v-string) as shown in FIG. In the example illustrated in FIG. 3, the joint score storage unit 92 stores Score “0.33” in association with a set of k-string “old” and v-string “last name”. Note that FIG. 3 shows each joint score relating to “the maiden name is Nakano.”, But the description of the joint score values of a pair other than the pair (old, surname) is omitted.

単語境界推定部20は、接合スコア算出部10によって生成された接合スコア(即ち、接合スコア記憶部92に記憶されている接合スコア)と、未知文字列記憶装置(非図示)に記憶されている未知文字列とから、当該未知文字列を単語毎に分割する際の文字列の単語境界を推定する。未知文字列の単語境界を推定した単語境界推定部20は、当該単語境界にて未知文字列を分割した各単語を抽出する。未知文字列から各単語を抽出した単語境界推定部20は、品詞無単語データとして、各単語をラベル有データ記憶部94に記憶する。   The word boundary estimator 20 is stored in a joint score generated by the joint score calculator 10 (that is, a joint score stored in the joint score storage unit 92) and an unknown character string storage device (not shown). From the unknown character string, the word boundary of the character string when the unknown character string is divided into words is estimated. The word boundary estimation unit 20 that has estimated the word boundary of the unknown character string extracts each word obtained by dividing the unknown character string at the word boundary. The word boundary estimation unit 20 that has extracted each word from the unknown character string stores each word in the labeled data storage unit 94 as part-of-speech non-word data.

ラベル有データ記憶部94は、単語境界推定部20から出力される品詞無単語データを記憶する。つまり、前述のラベル有データ記憶部90に記憶される品詞無単語データが、ユーザによって入力されたデータであるのに対し、ラベル有データ記憶部94に記憶される品詞無単語データは、機械的(単語境界推定部20)に出力されたデータである。ラベル有データ記憶部94に記憶される品詞無単語データは、品詞推定装置(非図示)による品詞推定に用いられる。   The labeled data storage unit 94 stores the part of speech no-word data output from the word boundary estimation unit 20. That is, the part-of-speech no-word data stored in the labeled data storage unit 90 is data input by the user, whereas the part-of-speech no-word data stored in the labeled data storage unit 94 is mechanical. This is the data output to (word boundary estimation unit 20). The part-of-speech non-word data stored in the labeled data storage unit 94 is used for part-of-speech estimation by a part-of-speech estimation device (not shown).

単語境界存否確率算出部30は、接合スコア記憶部92に記憶されている接合スコアを参照し、接合スコア毎に、単語境界存否確率を算出する。単語境界存否確率とは、文字列間に単語の境界が存在(成立)するか否かを表す確率である。   The word boundary existence probability calculation unit 30 refers to the joint score stored in the joint score storage unit 92 and calculates the word boundary existence probability for each joint score. The word boundary existence probability is a probability representing whether or not a word boundary exists (establishes) between character strings.

例えば、単語境界存否確率算出部30は、各接合スコアE(X=1、2、…、n)の事例数Z、即ち組数Zを算出するとともに、各事例数Zにおいて単語境界が存在した回数A、単語境界が存在しなかった回数Bを算出する。そして、単語境界存否確率算出部30は、接合スコアE毎に、単語境界存否確率として、単語境界が存在する確率A/Z、および、単語境界が存在しない確率B/Zを算出する。即ち、単語境界存否確率算出部30は、接合スコアEの単語境界存否確率として、A/ZおよびB/Z1を算出し、接合スコアEの単語境界存否確率として、A/ZおよびB/Zを算出し、・・・、接合スコアEの単語境界存否確率として、A/ZおよびB/Zを算出する。なお、単語境界存否確率算出部30は、A/ZまたはB/Zの何れか一方のみを算出してもよい。 Words for example, word boundary existence probability calculation unit 30, each joined score E X (X = 1,2, ... , n) case number Z X, i.e. to calculate the number of sets Z X, in each case the number Z X The number of times A X that the boundary exists and the number of times B X that the word boundary does not exist are calculated. Then, the word boundary presence / absence probability calculating unit 30 calculates, as the word boundary existence probability, the probability A X / Z X that the word boundary exists and the probability B X / Z X that the word boundary does not exist for each joint score E X. calculate. That is, word boundary existence probability calculation unit 30, a word boundary existence probability of the joint score E 1, to calculate the A 1 / Z 1 and B 1 / Z 1, as a word boundary existence probability of the joint score E 2, A 2 / Z 2 and B 2 / Z 2 is calculated, ..., as a word boundary existence probability of the joint score E n, to calculate the a n / Z n, and B n / Z n. Note that the word boundary existence probability calculation unit 30 may calculate only one of A X / Z X and B X / Z X.

また、例えば、単語境界存否確率算出部30は、接合スコア毎ではなく、接合スコアの範囲(値の範囲)に応じて分類された接合スコア群EGX(X=1、2、…、m)毎に、単語境界存否確率(AGX/ZGX、BGX/ZGX)を算出してもよい。つまり、単語境界存否確率算出部30は、近しい接合スコアをグループ化し(EG1、EG2、…)、グループ毎に、単語境界存否確率を算出してもよい。即ち、単語境界存否確率算出部30は、接合スコア群EG1の単語境界存否確率として、AG1/ZG1およびBG1/ZG1を算出し、接合スコアEG2の単語境界存否確率として、AG2/ZG2およびBG2/ZG2を算出し、・・・、接合スコアEの単語境界存否確率として、AGm/ZGmおよびBGm/ZGmを算出する。なお、単語境界存否確率算出部30は、AGX/ZGXまたはBGX/ZGXの何れか一方のみを算出してもよい。なお、事例数ZGXは、事例数Zよりも多いため、接合スコア群毎に単語境界存否確率を算出すれば、事例数Zが極端に少ない場合に生じる、妥当な単語境界存否確率が算出されないという問題を解決することができる。 In addition, for example, the word boundary existence probability calculation unit 30 does not set each junction score, but the junction score group E GX (X = 1, 2,..., M) classified according to the range (value range) of the junction score. Each time, the word boundary existence probability (A GX / Z GX , B GX / Z GX ) may be calculated. That is, the word boundary presence / absence probability calculation unit 30 may group the close joint scores (E G1 , E G2 ,...) And calculate the word boundary presence / absence probability for each group. That is, the word boundary existence probability calculating unit 30 calculates A G1 / Z G1 and B G1 / Z G1 as the word boundary existence probability of the joint score group E G1 , and sets A A as the word boundary existence probability of the joint score E G2. G2 / Z G2 and B G2 / Z G2 are calculated, and A Gm / Z Gm and B Gm / Z Gm are calculated as word boundary existence probabilities of the joint score E m . Note that the word boundary existence probability calculation unit 30 may calculate only one of A GX / Z GX and B GX / Z GX . Since the number of cases Z GX is larger than the number of cases Z X , if the word boundary existence probability is calculated for each joint score group, the appropriate word boundary existence probability that occurs when the number of cases Z is extremely small is calculated. The problem of not being able to be solved.

単語境界存否確率算出部30は、接合スコア毎または接合スコア群毎に算出した単語境界存否確率を単語境界存否確率記憶部96に記憶する。   The word boundary existence probability calculating unit 30 stores the word boundary existence probability calculated for each joint score or each joint score group in the word boundary existence probability storage unit 96.

単語境界存否確率記憶部96は、単語境界存否確率算出部30から出力される単語境界存否確率を記憶する。例えば、単語境界存否確率記憶部96は、図4(a)に示すように、接合スコア毎の単語境界存否確率を記憶する。また、単語境界存否確率記憶部96は、図4(b)に示すように、接合スコア群毎の単語境界存否確率を記憶する。   The word boundary existence probability storage unit 96 stores the word boundary existence probability output from the word boundary existence probability calculation unit 30. For example, the word boundary existence probability storage unit 96 stores the word boundary existence probability for each joint score as shown in FIG. Further, the word boundary existence probability storage unit 96 stores the word boundary existence probability for each joint score group as shown in FIG.

抽出部40は、外部、または、信頼性スコア付与部50からの要求に応じて、ラベル有データ記憶部94から単語境界の判定対象の文字列(以下、「対象文字列」という)を抽出する。具体的には、抽出部40は、対象文字列として、単語境界に係る情報を有する文字列(ラベル情報を保持したままの文字列)をラベル有データ記憶部94から抽出する(以下、当該抽出態様を「単語境界有抽出」という)。また、抽出部40は、対象文字列として、単語境界に係る情報を有しない文字列(ラベル情報を切り捨てた文字列)をラベル有データ記憶部94から抽出してもよい(以下、当該抽出態様を「単語境界無抽出」という)。なお、単語境界判定装置1は、抽出部40が、単語境界有抽出または単語境界無抽出の何れの抽出を行うかについて、予め固定的に予め設定しておいてもよいし、外部からの入力に応じて、単語境界有抽出と単語境界無抽出とを切り替えるようにしてもよい。   In response to a request from the outside or the reliability score assigning unit 50, the extraction unit 40 extracts a character string (hereinafter referred to as “target character string”) that is a word boundary determination target from the labeled data storage unit 94. . Specifically, the extraction unit 40 extracts a character string having information related to a word boundary (a character string that retains label information) from the labeled data storage unit 94 as the target character string (hereinafter, the extraction is performed). The mode is called “extraction with word boundary”). Further, the extraction unit 40 may extract, as the target character string, a character string that does not have information related to the word boundary (a character string obtained by discarding label information) from the labeled data storage unit 94 (hereinafter, the extraction mode). Is referred to as “no word boundary extraction”). Note that the word boundary determination device 1 may preliminarily set in advance whether the extraction unit 40 performs extraction with or without word boundary, or input from the outside Depending on the case, the extraction with word boundary and the extraction without word boundary may be switched.

抽出部40は、抽出した対象文字列(単語境界に係る情報を有する文字列または単語境界に係る情報を有しない文字列)を信頼性スコア付与部50に出力する。   The extraction unit 40 outputs the extracted target character string (a character string having information relating to a word boundary or a character string not having information relating to a word boundary) to the reliability score assignment unit 50.

信頼性スコア付与部50は、抽出部40から対象文字列を取得する。対象文字列を取得した信頼性スコア付与部50は、対象文字列に含まれる各文字列間の各接合スコアの信頼性スコアを付与する。具体的には、信頼性スコア付与部50は、一の文字列間の接合スコアの信頼性スコアとして、単語境界存否確率記憶部96に記憶されている各接合スコアに対応する単語境界存否確率を付与する。信頼性スコアとは、各接合スコアの信頼性を示す指標である。   The reliability score assignment unit 50 acquires the target character string from the extraction unit 40. The reliability score assigning unit 50 that has acquired the target character string assigns a reliability score of each joint score between the character strings included in the target character string. Specifically, the reliability score assigning unit 50 sets the word boundary existence probability corresponding to each joint score stored in the word boundary existence probability storage unit 96 as the reliability score of the joint score between one character string. Give. The reliability score is an index indicating the reliability of each joint score.

以下、信頼性スコア付与部50の信頼性スコア付与機能について、単語境界存否確率算出部30による単語境界存否確率の算出単位(接合スコア単位または接合スコア群単位)別、および、抽出部40による抽出態様別に詳細に説明する。   Hereinafter, regarding the reliability score giving function of the reliability score giving unit 50, the word boundary existence probability calculation unit 30 calculates the word boundary existence probability by unit (joint score unit or joint score group unit) and the extraction unit 40 extracts It demonstrates in detail according to an aspect.

(接合スコア単位の単語境界存否確率の算出、かつ、単語境界有抽出の場合)
なお、信頼性スコア付与部50は、抽出部40から、単語境界に係る情報を有する対象文字列「/すぐ/行く/!/」(「/」は単語境界に係る情報)を取得した場合を例にして説明する。
(Calculation of word boundary existence probability in joint score unit and extraction with word boundary)
The reliability score assigning unit 50 acquires the target character string “/ immediately / going /! /” (“/” Is information related to the word boundary) having information related to the word boundary from the extraction unit 40. An example will be described.

対象文字列「/すぐ/行く/!/」を取得した信頼性スコア付与部50は、対象文字列「/すぐ/行く/!/」に含まれる各文字列間の各接合スコアを接合スコア記憶部92から取得する。つまり、信頼性スコア付与部50は、1番目の文字列間((文字無)/すぐ)の接合スコアE、2番目の文字列間(すぐ/行く)の接合スコアEの接合スコアE、3番目の文字列間(行く/!)の接合スコアE、4番目の文字列間(!/(文字無))の接合スコアEを接合スコア記憶部92から取得する。 The reliability score assigning unit 50 that has acquired the target character string “/ immediate / go /! /” Stores the joint score between the character strings included in the target character string “/ immediate / go /! /”. Obtained from the unit 92. That is, reliability scoring unit 50, the first inter-string ((character Mu) / immediately) joined score E 1 of the second inter-string (immediately / Go) joined score E bonding score E 2 of 2, the third obtaining between strings (go /!) joined score E 3 of, inter 4th string (! / (character Mu)) joined score E 4 from the joint score storage unit 92.

各文字列間の各接合スコアを取得した信頼性スコア付与部50は、1〜4番目の各文字列間の各接合スコアE、E、E、Eに対応する単語境界存否確率を単語境界存否確率記憶部96から取得し、信頼性スコアとして各接合スコアE、E、E、Eに付与する。 The reliability score assigning unit 50 that has acquired the joint scores between the character strings has word boundary existence probabilities corresponding to the joint scores E 1 , E 2 , E 3 , and E 4 between the first to fourth character strings. Is obtained from the word boundary presence / absence probability storage unit 96, and is given to each joint score E 1 , E 2 , E 3 , E 4 as a reliability score.

即ち、単語境界存否確率記憶部96には接合スコア毎の単語境界存否確率が記憶されているが、例えば、単語境界存否確率記憶部96にEの単語境界存否確率として、A/ZとB/Zとが記憶されている場合、1番目の文字列間の接合スコアEにA/ZとB/Zとを付与し、単語境界存否確率記憶部96にEの単語境界存否確率として、A/ZとB/Zとが記憶されている場合、2番目の文字列間の接合スコアEにA/ZとB/Zとを付与し、…、単語境界存否確率記憶部96にEの単語境界存否確率として、A/ZとB/Zとが記憶されている場合、4番目の文字列間の接合スコアEにA/ZとB/Zとを付与する。 That is, the word boundary existence probability storage unit 96 word boundary existence probability of each joint score is stored, for example, the word boundary existence probability storage unit 96 as a word boundary existence probability of E 1, A 1 / Z 1 And B 1 / Z 1 are stored, A 1 / Z 1 and B 1 / Z 1 are assigned to the joint score E 1 between the first character strings, and the word boundary existence probability storage unit 96 as word boundary existence probability of E 2, a 2 / Z 2 and B 2 / if Z 2 and are stored, the joint score E 2 between second string a 2 / Z 2 and B 2 / Z 2 ,..., When A 4 / Z 4 and B 4 / Z 4 are stored as the word boundary existence probability of E 4 in the word boundary existence probability storage unit 96, between the fourth character strings A 4 / Z 4 and B 4 / Z 4 are given to the joining score E 4 of

(接合スコア群単位の単語境界存否確率の算出、かつ、単語境界有抽出の場合)
なお、信頼性スコア付与部50は、上記同様、対象文字列「/すぐ/行く/!/」を取得した場合を例にして説明する。
(Calculation of word boundary existence probability in joint score group unit and extraction with word boundary)
The reliability score assigning unit 50 will be described by taking as an example the case where the target character string “/ immediate / go /! /” Is acquired as described above.

対象文字列「/すぐ/行く/!/」を取得した信頼性スコア付与部50は、接合スコア単位の算出、かつ、単語境界有抽出の場合と同様、1番目の文字列間((文字無)/すぐ)の接合スコアE、2番目の文字列間(すぐ/行く)の接合スコアEの接合スコアE、3番目の文字列間(行く/!)の接合スコアE、4番目の文字列間(!/(文字無))の接合スコアEを接合スコア記憶部92から取得する。 The reliability score assigning unit 50 that acquired the target character string “/ immediately / go /! /” Calculates the joint score unit and extracts the first character string ((no character ) / immediately joining score E 1 of), the second between strings (immediately / go joining score E 2 of the joint score E 2 of), the third between strings (go /!) joining score E 3 of 4 The joint score E 4 between the second character strings (! / (No character)) is acquired from the joint score storage unit 92.

各文字列間の各接合スコアを取得した信頼性スコア付与部50は、1〜4番目の各文字列間の各接合スコアE、E、E、Eを含む接合スコア群の単語境界存否確率を単語境界存否確率記憶部96から取得し、信頼性スコアとして各接合スコアE、E、E、Eに付与する。 The reliability score assigning unit 50 that has acquired the joint scores between the character strings includes words of the joint score group including the joint scores E 1 , E 2 , E 3 , and E 4 between the first to fourth character strings. The boundary presence / absence probability is acquired from the word boundary presence / absence probability storage unit 96, and is given to each joint score E 1 , E 2 , E 3 , E 4 as a reliability score.

即ち、単語境界存否確率記憶部96には接合スコア群毎の単語境界存否確率が記憶されているが、例えば、単語境界存否確率記憶部96にEを含む接合スコア群EG1の単語境界存否確率として、AG1/ZG1とBG1/ZG1とが記憶されている場合、1番目の文字列間の接合スコアEにAG1/ZG1とBG1/ZG1とを付与し、単語境界存否確率記憶部96にEを含む接合スコア群EG2の単語境界存否確率として、AG2/ZG2とBG2/ZG2とが記憶されている場合、2番目の文字列間の接合スコアEにAG2/ZG2とBG2/ZG2とを付与し、…、単語境界存否確率記憶部96にEを含む接合スコア群EG4の単語境界存否確率として、AG4/ZとBG4/ZG4とが記憶されている場合、4番目の文字列間の接合スコアEにAG4/ZG4とBG4/ZG4とを付与する。 That is, the word boundary existence probability storage unit 96 word boundary existence probability of each joint score group is stored, for example, word boundaries existence of joint score group E G1 comprising E 1 word boundary existence probability storage unit 96 as a probability, if the a G1 / Z G1 and B G1 / Z G1 is stored, and a G1 / Z G1 and B G1 / Z G1 assigned to the joint score E 1 between first string, When A G2 / Z G2 and B G2 / Z G2 are stored as the word boundary presence / absence probabilities of the joint score group E G2 including E 2 in the word boundary existence probability storage unit 96, between the second character strings A G2 / Z G2 and B G2 / Z G2 are assigned to the joint score E 2 ,..., A G4 / as the word boundary existence probability of the joint score group E G4 including E 4 in the word boundary existence probability storage unit 96 Z 4 and B G4 / Z G4 Togaki If it is, the bonding score E 4 between the fourth string imparting and A G4 / Z G4 and B G4 / Z G4.

(接合スコア単位の単語境界存否確率の算出、かつ、単語境界無抽出の場合)
なお、信頼性スコア付与部50は、抽出部40から、単語境界に係る情報を有しない対象文字列「すぐ行く!」を取得した場合を例にして説明する。
(Calculation of word boundary existence probability in joint score unit and no word boundary extraction)
The reliability score assigning unit 50 will be described by taking as an example a case where the target character string “I will go immediately” that does not have information related to the word boundary is acquired from the extracting unit 40.

対象文字列「すぐ行く!」を取得した信頼性スコア付与部50は、対象文字列「すぐ行く!」に含まれる各文字間の各接合スコアを接合スコア記憶部92から取得する。つまり、信頼性スコア付与部50は、1番目の文字間((文字無)/す)の接合スコアE1’、2番目の文字間(す/ぐ)の接合スコアE2’、3番目の文字列間(ぐ/行)の接合スコアE3’、4番目の文字間(行/く)の接合スコアE4’、5番目の文字間(く/!)の接合スコアE5’、6番目の文字間(!/(文字無))の接合スコアE6’を接合スコア記憶部92から取得する。なお、6番目の文字間(!/(文字無))の接合スコアE6’は、対象文字列「/すぐ/行く/!/」を取得した際の4番目の文字列間(!/(文字無))の接合スコアEと同一である。 The reliability score assigning unit 50 that has acquired the target character string “I will go immediately!” Acquires the joint score between the characters included in the target character string “I will go immediately!” From the joint score storage unit 92. In other words, the reliability score assigning unit 50 has a joint score E 1 ′ between the first characters ((no character) / s), a joint score E 2 ′ between the second characters (s / g), and the third between strings joining score E 3 of the (tool / line) ', between the 4 th character (row / Ku) joined scores E 4 of', between the fifth character (ku /!) joining score E 5 of the ', 6 The joint score E 6 ′ between the second character (! / (No character)) is acquired from the joint score storage unit 92. Note that the joint score E 6 ′ between the sixth character (! / (No character)) is the fourth character string (! / () When the target character string “/ immediate / go /! /” Is acquired. character Mu)) is identical to the joined score E 4 of.

各文字間の各接合スコアを取得した信頼性スコア付与部50は、1〜6番目の各文字列間の各接合スコアE1’、E2’、E3’、E4’、E5’、E6’に対応する単語境界存否確率を単語境界存否確率記憶部96から取得し、信頼性スコアとして各接合スコアに付与する。 The reliability score assigning unit 50 that has acquired each joint score between each character, each joint score E 1 ′ , E 2 ′ , E 3 ′ , E 4 ′ , E 5 ′ between the first to sixth character strings. , E 6 ′ , the word boundary existence probability is acquired from the word boundary existence probability storage unit 96, and is given to each joint score as a reliability score.

即ち、単語境界存否確率記憶部96には接合スコア毎の単語境界存否確率が記憶されているが、例えば、単語境界存否確率記憶部96にE1’の単語境界存否確率として、A1’/Z1’とB1’/Z1’とが記憶されている場合、1番目の文字列間の接合スコアE1’にA1’/Z1’とB1’/Z1’とを付与し、単語境界存否確率記憶部96にE2’の単語境界存否確率として、A2’/Z2’とB2’/Z2’とが記憶されている場合、2番目の文字列間の接合スコアE2’にA2’/Z2’とB2’/Z2’とを付与し、…、単語境界存否確率記憶部96にE6’の単語境界存否確率として、A6’/Z6’とB6’/Z6’とが記憶されている場合、6番目の文字列間の接合スコアE6’にA6’/Z6’とB6’/Z6’とを付与する。 That is, the word boundary existence probability storage unit 96 word boundary existence probability of each joint score is stored, for example, the word boundary existence probability storage unit 96 E 1 'as a word boundary existence probability, A 1' / When Z 1 ′ and B 1 ′ / Z 1 ′ are stored, A 1 ′ / Z 1 ′ and B 1 ′ / Z 1 ′ are assigned to the joint score E 1 ′ between the first character strings. When A 2 ′ / Z 2 ′ and B 2 ′ / Z 2 ′ are stored as the word boundary existence probability of E 2 ′ in the word boundary existence probability storage unit 96, the second boundary between the second character strings A 2 ′ / Z 2 ′ and B 2 ′ / Z 2 ′ are given to the joint score E 2 ′ ,..., A 6 ′ / as the word boundary existence probability of E 6 ′ in the word boundary existence probability storage unit 96 When Z 6 ′ and B 6 ′ / Z 6 ′ are stored, A 6 ′ / Z 6 ′ and B 6 ′ / Z 6 ′ are assigned to the joint score E 6 ′ between the sixth character strings. To do.

(接合スコア群単位の単語境界存否確率の算出、かつ、単語境界無抽出の場合)
なお、信頼性スコア付与部50は、上記同様、対象文字列「すぐ行く!」を取得した場合を例にして説明する。
(Calculation of word boundary existence probability in joint score group unit and no word boundary extraction)
The reliability score assigning unit 50 will be described by taking as an example the case where the target character string “I will go immediately!” Is acquired as described above.

対象文字列「すぐ行く!」を取得した信頼性スコア付与部50は、接合スコア単位の算出、かつ、単語境界無抽出の場合と同様、1番目の文字間((文字無)/す)の接合スコアE1’、2番目の文字間(す/ぐ)の接合スコアE2’、3番目の文字列間(ぐ/行)の接合スコアE3’、4番目の文字間(行/く)の接合スコアE4’、5番目の文字間(く/!)の接合スコアE5’、6番目の文字間(!/(文字無))の接合スコアE6’を接合スコア記憶部92から取得する。 The reliability score assigning unit 50 that has acquired the target character string “I will go immediately!” Calculates the joint score unit and extracts the first character space ((no character) / su) as in the case of no word boundary extraction. I joined score E 1 ', between the second character (to / immediately) joined score E 2' and joined score E 3 between the third character string (tool / row) ', between the fourth character (line / Ku ) joined score E 4 'and between the fifth character (ku /!) joined score E 5' and between the sixth character (! / (character Mu) joint score storage unit the joint score E 6 ') of 92 Get from.

各文字列間の各接合スコアを取得した信頼性スコア付与部50は、1〜6番目の各文字列間の各接合スコアE1’、E2’、E3’、E4’、E5’、E6’を含む接合スコア群の単語境界存否確率を単語境界存否確率記憶部96から取得し、信頼性スコアとして各接合スコアE1’、E2’、E3’、E4’、E5’、E6’に付与する。 The reliability score assigning unit 50 that has acquired the joint scores between the character strings has the joint scores E 1 ′ , E 2 ′ , E 3 ′ , E 4 ′ , E 5 between the first to sixth character strings. The word boundary existence probability of the joint score group including ' , E 6' is acquired from the word boundary existence probability storage unit 96, and each joint score E 1 ' , E 2' , E 3 ' , E 4' , To E 5 ′ and E 6 ′ .

即ち、単語境界存否確率記憶部96には接合スコア群毎の単語境界存否確率が記憶されているが、例えば、単語境界存否確率記憶部96にE1’を含む接合スコア群EG1’ の単語境界存否確率として、AG1’/ZG1’とBG1’/ZG1’とが記憶されている場合、1番目の文字列間の接合スコアE1’にAG1’/ZG1’とBG1’/ZG1’とを付与し、単語境界存否確率記憶部96にE2’を含む接合スコア群EG2’の単語境界存否確率として、AG2’/ZG2’とBG2’/ZG2’とが記憶されている場合、2番目の文字列間の接合スコアE2’にAG2’/ZG2’とBG2’/ZG2’とを付与し、…、単語境界存否確率記憶部96にEを含む接合スコア群EG6’の単語境界存否確率として、AG6’/ZG6’とBG6’/ZG6’とが記憶されている場合、4番目の文字列間の接合スコアEにAG6’/ZG6’とBG6’/ZG6’とを付与する。 That is, the word boundary existence probability storage unit 96 stores the word boundary existence probability for each joint score group. For example, the words in the joint score group E G1 ′ including E 1 ′ in the word boundary existence probability storage unit 96 as a boundary existence probability, if a G1 '/ Z G1' and the B G1 '/ Z G1' is stored, the first 'to a G1' joined score E 1 between strings and / Z G1 'B G1 ′ / Z G1 ′ is assigned, and the word boundary existence probability of the joint score group E G2 ′ including E 2 ′ in the word boundary existence probability storage unit 96 is set as A G2 ′ / Z G2 ′ and B G2 ′ / Z. When G2 ′ is stored, A G2 ′ / Z G2 ′ and B G2 ′ / Z G2 ′ are assigned to the joint score E 2 ′ between the second character strings, and the word boundary existence probability storage is performed. As the word boundary existence probability of the joint score group E G6 ′ including E 6 in the part 96, A G6 ′ / Z G6 ′ and B G When 6 ′ / Z G6 ′ is stored, A G6 ′ / Z G6 ′ and B G6 ′ / Z G6 ′ are assigned to the joint score E 4 between the fourth character strings.

各接合スコアの信頼性スコアを付与した信頼性スコア付与部50は、各接合スコアの信頼性スコアを学習データ更新部60に出力する。具体的には、抽出部40による抽出が単語境界有抽出の場合、即ち、信頼性スコア付与部50は、単語境界に係る情報を有する対象文字列を抽出部40から取得し信頼性スコアを付与した場合には、当該対象文字列に含まれる各文字列間の各接合スコアの信頼性スコアを学習データ更新部60に出力する。一方、抽出部40による抽出が単語境界無抽出の場合、即ち、信頼性スコア付与部50は、単語境界に係る情報を有しない対象文字列を抽出部40から取得し信頼性スコアを付与した場合には、当該対象文字列に含まれる各文字列間の各接合スコアおよび各接合スコアの信頼性スコアを学習データ更新部60に出力する。   The reliability score assigning unit 50 to which the reliability score of each joint score is assigned outputs the reliability score of each joint score to the learning data update unit 60. Specifically, when extraction by the extraction unit 40 is extraction with a word boundary, that is, the reliability score assigning unit 50 acquires a target character string having information related to the word boundary from the extraction unit 40 and assigns a reliability score. If it is, the reliability score of each joint score between the character strings included in the target character string is output to the learning data update unit 60. On the other hand, when extraction by the extraction unit 40 is no word boundary extraction, that is, the reliability score assigning unit 50 obtains a target character string that does not have information related to the word boundary from the extraction unit 40 and assigns a reliability score. Are output to the learning data update unit 60 each joint score between the character strings included in the target character string and the reliability score of each joint score.

学習データ更新部60は、対象文字列に含まれる各文字列間の各接合スコアの信頼性スコアを信頼性スコア付与部50から取得する。具体的には、抽出部40による抽出が単語境界有抽出の場合、学習データ更新部60は、対象文字列に含まれる各文字列間の各接合スコアの信頼性スコアを信頼性スコア付与部50から取得する。一方、抽出部40による抽出が単語境界無抽出の場合、学習データ更新部60は、対象文字列に含まれる各文字列間の各接合スコアおよび各接合スコアの信頼性スコアを信頼性スコア付与部50から取得する。   The learning data update unit 60 acquires the reliability score of each joint score between the character strings included in the target character string from the reliability score assigning unit 50. Specifically, when the extraction by the extraction unit 40 is extraction with a word boundary, the learning data update unit 60 sets the reliability score of each joint score between the character strings included in the target character string to the reliability score giving unit 50. Get from. On the other hand, when the extraction by the extraction unit 40 is no word boundary extraction, the learning data update unit 60 sets each joint score between character strings included in the target character string and the reliability score of each joint score as a reliability score giving unit. Get from 50.

各接合スコアの信頼性スコアを取得した学習データ更新部60は、対象文字列内の単語境界を判定する。具体的には、学習データ更新部60は、対象文字列に含まれる各文字列間の各接合スコアの信頼性スコアを信頼性スコア付与部50から取得した場合(各接合スコア自体は取得しなかった場合)、各文字列間の各接合スコアの信頼性スコアに基づいて対象文字列内の単語境界を判定する。換言すれば、学習データ更新部60は、抽出部40による抽出が単語境界有抽出の場合、各文字列間の各接合スコアの信頼性スコアのみに基づいて対象文字列内の単語境界を判定する。   The learning data update unit 60 that has acquired the reliability score of each joint score determines a word boundary in the target character string. Specifically, when the learning data update unit 60 acquires the reliability score of each joint score between the character strings included in the target character string from the reliability score assigning unit 50 (the joint score itself is not acquired). The word boundary in the target character string is determined based on the reliability score of each joint score between the character strings. In other words, when the extraction by the extraction unit 40 is extraction with a word boundary, the learning data update unit 60 determines a word boundary in the target character string based only on the reliability score of each joint score between the character strings. .

一方、学習データ更新部60は、対象文字列に含まれる各文字列間の各接合スコアおよび各接合スコアの信頼性スコアを信頼性スコア付与部50から取得した場合(各接合スコア自体も取得した場合)、各文字列間の各接合スコアおよび各接合スコアの信頼性スコアに基づいて対象文字列内の単語境界を判定する。換言すれば、学習データ更新部60は、抽出部40による抽出が単語境界無抽出の場合、各文字列間の各接合スコアおよび信頼性スコアに基づいて対象文字列内の単語境界を判定する。   On the other hand, when the learning data update unit 60 obtains each joint score between the character strings included in the target character string and the reliability score of each joint score from the reliability score assigning unit 50 (each joint score itself is also obtained). ), A word boundary in the target character string is determined based on each joint score between the character strings and the reliability score of each joint score. In other words, when the extraction by the extraction unit 40 is no word boundary extraction, the learning data update unit 60 determines a word boundary in the target character string based on each joint score and reliability score between the character strings.

以下、信頼性スコアのみに基づいて単語境界を判定する場合、および、接合スコアおよび信頼性スコアに基づいて単語境界を判定する場合の学習データ更新部60の単語境界の判定機能について詳細に説明する。   Hereinafter, the determination function of the word boundary of the learning data update unit 60 when determining the word boundary based only on the reliability score and when determining the word boundary based on the joint score and the reliability score will be described in detail. .

(信頼性スコアのみに基づいて単語境界を判定する場合)
なお、学習データ更新部60は、対象文字列「/すぐ/行く/!/」に含まれる、1番目の文字列間((文字無)/すぐ)の接合スコアEの信頼性スコア(AG1/ZG1とBG1/ZG1)、2番目の文字列間(すぐ/行く)の接合スコアEの信頼性スコア(AG2/ZG2とBG2/ZG2)、3番目の文字列間(行く/!)の接合スコアEの信頼性スコア(AG3/ZG3とBG3/ZG3)、および、4番目の文字列間(!/(文字無))の接合スコアEの信頼性スコア(AG4/ZG4とBG4/ZG4)を取得した場合を例にして説明する。
また、説明の便宜上、接合スコアEの信頼性スコアのうち、単語境界が存在する確率(上記例ではAG1/ZG1)を信頼性スコアP(E、単語境界が存在しない確率(上記例ではBG1/ZG1)を信頼性スコアP(Eと表記し、接合スコアEの信頼性スコアのうち、単語境界が存在する確率(上記例ではAG2/ZG2)を信頼性スコアP(E、単語境界が存在しない確率(上記例ではBG2/ZG2)を信頼性スコアP(Eと表記する。接合スコアEおよびEについても同様に表記する。
(When judging word boundaries based only on reliability scores)
The learning data updating unit 60 is included in the target character string "/ immediately / go /! /", The first between the string ((character Mu) / immediately) reliability score of joining score E 1 of (A G1 / Z G1 and B G1 / Z G1 ) Reliability score (A G2 / Z G2 and B G2 / Z G2 ) of the joint score E 2 between the second character strings (immediate / go), the third character Reliability score (A G3 / Z G3 and B G3 / Z G3 ) of the joint score E 3 between columns (go /!), And the joint score E between the fourth character strings (! / (No character)) A case where reliability scores of 4 (A G4 / Z G4 and B G4 / Z G4 ) are acquired will be described as an example.
For the convenience of explanation, among the reliability scores of the joint score E 1 , the probability that a word boundary exists (A G1 / Z G1 in the above example) is the reliability score P (E 1 ) A , and the probability that no word boundary exists (In the above example, B G1 / Z G1 ) is expressed as a reliability score P (E 1 ) B, and the probability that a word boundary exists in the reliability score of the joint score E 2 (A G2 / Z G2 in the above example) ) Is represented as a reliability score P (E 2 ) A , and a probability that a word boundary does not exist (B G2 / Z G2 in the above example) is represented as a reliability score P (E 2 ) B. The joint scores E 3 and E 4 are similarly written.

対象文字列「/すぐ/行く/!/」に係る、各E〜Eに対して、P(E、P(E、P(E、P(E、P(E、P(E、P(EおよびP(Eを取得した学習データ更新部60は、参考文献1のエントロピーの公式を用いて、具体的には、下記式(5)により、各E(n=1、2、3、4)に対して、エントロピー値Iを算出する。Iの値は、曖昧性の高さを示す指標である。
(参考文献1)
M. Li and I. K. Sethi, "Confidence-Based Active Learning", IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 8, pp. 1251-1261, 2006
P (E 1 ) A , P (E 1 ) B , P (E 2 ) A , P (E 2 ) for each E 1 to E 4 related to the target character string “/ immediate / go /! /” The learning data update unit 60 that has acquired B , P (E 3 ) A , P (E 3 ) B , P (E 4 ) A, and P (E 4 ) B uses the entropy formula of Reference 1 More specifically, the following equation (5), for each E n (n = 1, 2, 3, 4), to calculate the entropy value I n. The value of I n is an index showing the height of ambiguity.
(Reference 1)
M. Li and IK Sethi, "Confidence-Based Active Learning", IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 8, pp. 1251-1261, 2006

Figure 0005566704
Figure 0005566704

学習データ更新部60は、Iと所定の閾値Cとの大小関係を比較する。そして、学習データ更新部60は、Iと所定の閾値Cとの大小関係に基づいて、Eに係る文字列間の単語境界の存否を判定する。 Learning data updating unit 60 compares the magnitude relationship between I n and a predetermined threshold C 1. Then, the learning data updating unit 60 on the basis of the magnitude relationship between I n and a predetermined threshold C 1, determines the presence or absence of a word boundary between strings according to E n.

具体的には、学習データ更新部60は、I<Cの場合には、Eに係る文字列間に単語境界が存在すると判定する。例えば、I<Cである場合、2番目の文字列間(すぐ/行く)に単語境界が存在すると判定する。Iの値が小さいときは、抽出部40がラベル有データ記憶部94から抽出した単語境界に係る情報によるEの信頼性(確実性)が高いからである。また、学習データ更新部60は、I≧Cの場合には、Eに係る文字列間に単語境界が存在しない判定する。Iの値が大きいときは、抽出部40がラベル有データ記憶部94から抽出した単語境界に係る情報によるEの信頼性が高くないからである。 Specifically, the learning data updating unit 60 judges that in the case of I n <C 1, the word boundary is present between the string according to E n. For example, when I 2 <C 1, it is determined that a word boundary exists between the second character strings (immediately / going). When the value of I n is small, the extraction section 40 and the reliability of E n by the information relating to word boundaries extracted from the label chromatic data storage unit 94 (certainty) is because high. Also, the learning data updating section 60 in the case of I n ≧ C 1 determines that no word boundaries exist between strings according to E n. When the value of I n is large, the extraction unit 40 because there is not high reliability E n by the information relating to word boundaries extracted from the label chromatic data storage unit 94.

以上のように、学習データ更新部60は、対象文字列に含まれる各文字列間の各接合スコアの信頼性スコアのみに基づいて対象文字列内の単語境界を判定する。   As described above, the learning data update unit 60 determines a word boundary in the target character string based only on the reliability score of each joint score between the character strings included in the target character string.

なお、上記説明では、学習データ更新部60が、各E〜Eに対して、P(E、P(E、P(E、P(E、P(E、P(E、P(EおよびP(Eを取得する例を説明したが、学習データ更新部60が、各E〜Eに対して、P(E、P(E、P(EおよびP(Eを取得せずに、P(E、P(E、P(EおよびP(Eのみを取得した場合には、下記式(6)により、エントロピー値Iを算出する。 In the above description, the learning data updating unit 60 performs P (E 1 ) A , P (E 1 ) B , P (E 2 ) A , P (E 2 ) B for each of E 1 to E 4 . , P (E 3 ) A , P (E 3 ) B , P (E 4 ) A, and P (E 4 ) B have been described. However, the learning data update unit 60 uses each of E 1 to E 4. On the other hand, without obtaining P (E 1 ) B , P (E 2 ) B , P (E 3 ) B and P (E 4 ) B , P (E 1 ) A , P (E 2 ) A , when acquiring only P (E 3) a and P (E 4) a is the following equation (6), to calculate the entropy value I n.

Figure 0005566704
Figure 0005566704

(接合スコアおよび信頼性スコアに基づいて単語境界を判定する場合)
なお、学習データ更新部60は、対象文字列「すぐ行く!」に含まれる、1番目の文字間((文字無)/す)の接合スコアE1’の信頼性スコア(AG1’/ZG1’とBG1’/ZG1’)、2番目の文字列間(す/ぐ)の接合スコアE2’の信頼性スコア(AG2’/ZG2’とBG2’/ZG2’)、3番目の文字列間(ぐ/行)の接合スコアE3’の信頼性スコア(AG3’/ZG3’とBG3’/ZG3’)、4番目の文字列間(行/く)の接合スコアE4’の信頼性スコア(AG4’/ZG4’とBG4’/ZG4’)、5番目の文字列間(く/!)の接合スコアE5’の信頼性スコア(AG5’/ZG5’とBG5’/ZG5’)、および、6番目の文字列間(!/(文字無))の接合スコアE6’の信頼性スコア(AG6’/ZG6’とBG6’/ZG6’)を取得した場合を例にして説明する。
また、説明の便宜上、接合スコアE1’の信頼性スコアのうち、単語境界が存在する確率(上記例ではAG1’/ZG1’)を信頼性スコアP(E1’、単語境界が存在しない確率(上記例ではBG1’/ZG1’)を信頼性スコアP(E1’と表記し、接合スコアE2’の信頼性スコアのうち、単語境界が存在する確率(上記例ではAG2’/ZG2’)を信頼性スコアP(E2’、単語境界が存在しない確率(上記例ではBG2’/ZG2’)を信頼性スコアP(E2’と表記する。接合スコアE3’〜E6’についても同様に表記する。
(When judging word boundaries based on joint score and reliability score)
Note that the learning data update unit 60 includes the reliability score (A G1 ′ / Z) of the joint score E 1 ′ between the first characters ((no character) / su) included in the target character string “I will go immediately!”. G1 ′ and B G1 ′ / Z G1 ′ ) and the reliability score of the joint score E 2 ′ between the second character strings (A G2 ′ / Z G2 ′ and B G2 ′ / Z G2 ′ ) Reliability score (A G3 ′ / Z G3 ′ and B G3 ′ / Z G3 ′ ) of the joint score E 3 ′ between the third character strings ( G / line), between the fourth character strings (line / column) ) Joint score E 4 ′ reliability score (A G4 ′ / Z G4 ′ and B G4 ′ / Z G4 ′ ), the reliability score of the joint score E 5 ′ between the fifth character strings ((/!) (a G5 '/ Z G5' and B G5 '/ Z G5'), and, sixth between strings (! / (character Mu)) 'reliability score (a G6' joined score E 6 of / Z G6 ' And B G6 ′ / Z G6 ′ ) will be described as an example.
Further, for convenience of explanation, the probability that a word boundary exists (A G1 ′ / Z G1 ′ in the above example) among the reliability scores of the joint score E 1 ′ is represented by the reliability score P (E 1 ′ ) A and the word boundary. Is expressed as a reliability score P (E 1 ′ ) B, and the probability that a word boundary exists in the reliability score of the joint score E 2 ′ (in the above example, B G1 ′ / Z G1 ′ ) In the above example, A G2 ′ / Z G2 ′ ) is the reliability score P (E 2 ′ ) A , and the probability that no word boundary exists (in the above example, B G2 ′ / Z G2 ′ ) is the reliability score P (E 2 ′ ) Indicated as B. The joint scores E 3 ′ to E 6 ′ are similarly described.

対象文字列「すぐ行く!」に係る、各E〜Eに対して、P(E1’、P(E1’、P(E2’、P(E2’、P(E3’、P(E3’、P(E4’、P(E4’、P(E5’、P(E5’、P(EおよびP(E6’を取得した学習データ更新部60は、上記式(5)により、各E(n=1、2、3、4、5、6)に対して、エントロピー値Iを算出する。 For each of E 1 to E 6 related to the target character string “I will go immediately”, P (E 1 ′ ) A , P (E 1 ′ ) B , P (E 2 ′ ) A , P (E 2 ′ ) B , P (E 3 ′ ) A , P (E 3 ′ ) B , P (E 4 ′ ) A , P (E 4 ′ ) B , P (E 5 ′ ) A , P (E 5 ′ ) B , P (E 6 ) A and P (E 6 ′ ) B have been acquired, the learning data update unit 60 obtains each E n (n = 1, 2, 3, 4, 5, 6) according to the above equation (5). respect, to calculate the entropy value I n.

学習データ更新部60は、各接合スコアEと所定の閾値Cの大小関係、および、Iと所定の閾値Cとの大小関係を比較する。そして、学習データ更新部60は、各接合スコアEと所定の閾値Cとの大小関係、および、Iと所定の閾値Cとの大小関係に基づいて、文字列間の単語境界の存否を判定する。 Learning data updating unit 60, the magnitude relationship of each joint score E n and the predetermined threshold value C 2, and compares the magnitude relation between I n and a predetermined threshold C 1. Then, the learning data updating unit 60, the magnitude relationship between the bonding score E n and the predetermined threshold value C 2, and, based on the magnitude relationship between I n and a predetermined threshold C 1, the word boundary between strings Determine if it exists.

具体的には、学習データ更新部60は、E<C、かつ、I<Cの場合には、Eに係る文字列間に単語境界が存在すると判定する。例えば、E<C、かつ、I<Cである場合、2番目の文字列間(す/ぐ)に単語境界が存在すると判定する。Eの値が小さく、かつ、Iの値が小さいときは、「両文字列間に単語境界が成立する事象が多い」ということの信頼性(確実性)が高いからである。また、学習データ更新部60は、E≧C、かつ、I<Cの場合には、Eに係る文字列間に単語境界が存在しない判定する。Eの値が大きく、かつ、Iの値が小さいときは、「両文字列間に単語境界が成立しない事象が多い」ということの信頼性が高いからである。また、学習データ更新部60は、I≧Cの場合には、Eによって示される文字列間に単語境界が存在しない判定する。Eの値に関わらず、Iの値が大きいときは、Eの値自体の信頼性が高くないからである。 Specifically, the learning data updating unit 60, E n <C 2 and, in the case of I n <C 1 determines the word boundary exists between the string according to E n. For example, when E 2 <C 2 and I 2 <C 1, it is determined that a word boundary exists between the second character strings. The value of E n is small and, when the value of I n is small, the "both-character words boundaries often events established between columns" of that reliability (certainty) is because high. In addition, the learning data update unit 60 determines that there is no word boundary between the character strings related to E n when E n ≧ C 2 and I n <C 1 . The value of E n is large and, when the value of I n is small, it is highly reliable in that "often events word boundary is not established between the two strings." Also, the learning data updating section 60 in the case of I n ≧ C 1 determines that no word boundaries exist between strings indicated by E n. Regardless of the value of E n, when the value of I n is large, because no reliable value itself E n.

以上のように、学習データ更新部60は、対象文字列に含まれる各文字列間の各接合スコア、および、各接合スコアの信頼性スコアに基づいて対象文字列内の単語境界を判定する。   As described above, the learning data update unit 60 determines a word boundary in the target character string based on each joint score between the character strings included in the target character string and the reliability score of each joint score.

対象文字列内の各単語境界の存否を判定した学習データ更新部60は、単語境界の判定結果に基づいて、ラベル有データ記憶部90に記憶されている品詞無単語データ(学習データ)を更新する。   The learning data updating unit 60 that has determined the presence or absence of each word boundary in the target character string updates the part-of-speech non-word data (learning data) stored in the labeled data storage unit 90 based on the determination result of the word boundary. To do.

具体的には、ラベル有データ記憶部90に、ユーザによって入力された品詞無単語データを記憶する第1の領域と、学習データ更新部60による判定結果に従った境界において対象文字列を分割した各文字列(各品詞無単語データ)を記憶する第2の領域とを設けておき、学習データ更新部60は、判定結果に従った境界において対象文字列を分割した各文字列(各品詞無単語データ)を、上記第2の領域に追加する。なお、接合スコア算出部10は、最終的には、ラベル有データ記憶部90の第2の領域に記憶されているデータ(学習データ)を用いて、接合スコアを算出するようにすることが好ましい。これにより、ラベル有データ記憶部94に記憶される品詞無単語データの精度が向上し、品詞推定装置(非図示)の品詞推定の精度も向上するようになる。   Specifically, the target character string is divided at the boundary in accordance with the first region storing the part-of-speech non-word data input by the user in the labeled data storage unit 90 and the determination result by the learning data update unit 60. A second area for storing each character string (each word-of-speech non-word data) is provided, and the learning data update unit 60 divides the target character string at the boundary according to the determination result (each part-of-speech-free). Word data) is added to the second area. In addition, it is preferable that the joining score calculation unit 10 ultimately calculates the joining score using data (learning data) stored in the second area of the labeled data storage unit 90. . Thereby, the accuracy of the part-of-speech non-word data stored in the labeled data storage unit 94 is improved, and the accuracy of the part-of-speech estimation of the part-of-speech estimation device (not shown) is also improved.

なお、ラベル有データ記憶部90に、上記第2の領域を設けずに、学習データ更新部60は、判定結果に従った境界において対象文字列を分割した各文字列(各品詞無単語データ)がラベル有データ記憶部90に記憶されていなければ追加し、判定結果に従った境界と異なる境界において対象文字列を分割した各文字列(各品詞無単語データ)がラベル有データ記憶部90に記憶されていれば削除するようにしてもよい。   In addition, without providing the second area in the labeled data storage unit 90, the learning data update unit 60 divides each character string (each part-of-speech non-word data) obtained by dividing the target character string at the boundary according to the determination result. Is stored in the labeled data storage unit 90, and each character string (each part-of-speech non-word data) obtained by dividing the target character string at a boundary different from the boundary according to the determination result is stored in the labeled data storage unit 90. If it is stored, it may be deleted.

以上のように、学習データ更新部60は、単語境界の判定結果に基づいて、判定対象の文字列内の各単語の学習データへの反映の採否を決定し、単語毎に、学習データを更新する。なお、学習データ更新部60は、単語境界の判定結果に基づいて、判定対象の文字列全体の学習データへの反映の採否を決定し、文字列毎に、学習データを更新してもよい。   As described above, the learning data update unit 60 determines whether to reflect each word in the character string to be determined to be reflected in the learning data based on the word boundary determination result, and updates the learning data for each word. To do. Note that the learning data updating unit 60 may determine whether to reflect the entire character string to be determined to be reflected in the learning data based on the determination result of the word boundary, and may update the learning data for each character string.

例えば、学習データ更新部60は、信頼性スコアのみに基づいて単語境界を判定した場合(抽出部40による抽出が単語境界有抽出の場合)には、一の対象文字列内の各Iの平均値IAVEを算出する。そして、学習データ更新部60は、IAVEと所定の閾値Cの大小関係を比較し、IAVE<Cの場合には、対象文字列全体を学習データに反映する。IAVEの値が小さいときは、抽出部40がラベル有データ記憶部94から抽出した単語境界に係る情報によるEの信頼性(確実性)が、対象文字列全体として平均的に高いからである。また、学習データ更新部60は、IAVE≧Cの場合には、対象文字列全体を学習データに反映しない。IAVEの値が大きいときは、抽出部40がラベル有データ記憶部94から抽出した単語境界に係る情報によるEの信頼性が、対象文字列全体として平均的に高くないからである。 For example, the learning data updating unit 60, when it is determined word boundary based only on the confidence score (if extraction by the extraction unit 40 of the word boundaries organic extraction), each I n in one subject string An average value I AVE is calculated. Then, the learning data update unit 60 compares the magnitude relationship between I AVE and a predetermined threshold C 3 , and if I AVE <C 3 , reflects the entire target character string in the learning data. Is when the value of I AVE is small, the reliability of E n extraction unit 40 by the information related to word boundaries extracted from the label chromatic data storage unit 94 (certainty) is, since the average high overall target string is there. In addition, the learning data update unit 60 does not reflect the entire target character string in the learning data when I AVE ≧ C 3 . When the value of I AVE is large, the reliability of E n extraction unit 40 by the information related to word boundaries extracted from the label chromatic data storage unit 94, since no average high overall target string.

また、例えば、学習データ更新部60は、接合スコアおよび信頼性スコアに基づいて単語境界を判定した場合(抽出部40による抽出が単語境界無抽出の場合)には、一の対象文字列内の各Eの平均値EAVE、および、各Iの平均値IAVEを算出する。そして、学習データ更新部60は、EAVEと所定の閾値Cの大小関係、および、IAVEと所定の閾値Cの大小関係を比較し、学習データの更新の要否を判定してもよい。一例として、学習データ更新部60は、EAVE≧C、かつ、IAVE<Cの場合には、対象文字列全体を学習データに反映するようにしてもよい。EAVEの値が大きく、かつ、IAVEの値が小さいときは、対象文字列全体が一塊であることの信頼性(確実性)が高いからである。 In addition, for example, when the learning data update unit 60 determines a word boundary based on the joint score and the reliability score (when extraction by the extraction unit 40 is no word boundary extraction), the learning data update unit 60 mean value E AVE of each E n, and calculates an average value I AVE of each I n. Then, the learning data update unit 60 compares the magnitude relationship between E AVE and the predetermined threshold C 4 and the magnitude relationship between I AVE and the predetermined threshold C 3 , and determines whether or not the learning data needs to be updated. Good. As an example, the learning data update unit 60 may reflect the entire target character string in the learning data when E AVE ≧ C 4 and I AVE <C 3 . This is because when the value of E AVE is large and the value of I AVE is small, the reliability (certainty) that the entire target character string is a lump is high.

続いて、単語境界判定装置1の動作を説明する。図5(a)は単語境界存否確率記憶部96に単語境界存否確率が記憶される迄の動作の一例を示すフローチャートである。図5(b)はラベル有データ記憶部94に品詞無単語データが記憶される迄の動作の一例を示すフローチャートである。図5(c)はラベル有データ記憶部90が更新される迄の動作の一例を示すフローチャートである。   Next, the operation of the word boundary determination device 1 will be described. FIG. 5A is a flowchart showing an example of the operation until the word boundary existence probability is stored in the word boundary existence probability storage unit 96. FIG. 5B is a flowchart showing an example of the operation until the part-of-speech no-word data is stored in the labeled data storage unit 94. FIG. 5C is a flowchart showing an example of the operation until the labeled data storage unit 90 is updated.

図5(a)において、接合スコア算出部10は、ラベル有データ記憶部90に記憶されている文章データ(学習データ)を用いて、接合スコアを算出する(ステップ10)。接合スコアを算出した接合スコア算出部10は、接合スコアを接合スコア記憶部92に出力する。単語境界存否確率算出部30は、接合スコア記憶部92に記憶されている接合スコアを参照し、接合スコア毎または接合スコア群毎に単語境界存否確率を算出する(ステップS20)。接合スコア毎または接合スコア群毎に単語境界存否確率を算出した単語境界存否確率算出部30は、単語境界存否確率を単語境界存否確率記憶部96に記憶する。そして、図5(a)に示すフローチャートは終了する。   In FIG. 5A, the joint score calculation unit 10 calculates a joint score using sentence data (learning data) stored in the labeled data storage unit 90 (step 10). The joint score calculation unit 10 that has calculated the joint score outputs the joint score to the joint score storage unit 92. The word boundary existence probability calculation unit 30 refers to the joint score stored in the joint score storage unit 92 and calculates the word boundary existence probability for each joint score or joint score group (step S20). The word boundary existence probability calculating unit 30 that calculates the word boundary existence probability for each joint score or each joint score group stores the word boundary existence probability in the word boundary existence probability storage unit 96. Then, the flowchart shown in FIG.

図5(b)において、単語境界推定部20は、接合スコア記憶部92に記憶されている接合スコアと未知文字列とから、未知文字列の単語境界を推定し、単語境界にて未知文字列を分割した各単語を抽出する(ステップS110)。各単語を抽出した単語境界推定部20は、品詞無単語データとして、各単語をラベル有データ記憶部94に記憶する(ステップS120)。そして、図5(b)に示すフローチャートは終了する。   In FIG. 5B, the word boundary estimation unit 20 estimates the word boundary of the unknown character string from the joint score and the unknown character string stored in the joint score storage unit 92, and the unknown character string at the word boundary. Each word obtained by dividing is extracted (step S110). The word boundary estimation unit 20 that has extracted each word stores each word in the labeled data storage unit 94 as part-of-speech non-word data (step S120). Then, the flowchart shown in FIG. 5B ends.

図5(c)において、抽出部40は、ラベル有データ記憶部94から対象文字列を抽出する(ステップS210)。対象文字列を抽出した抽出部40は、対象文字列を信頼性スコア付与部50に出力する。対象文字列を取得した信頼性スコア付与部50は、対象文字列に含まれる各文字列間の各接合スコアに、信頼性スコアを付与する(ステップS220)。具体的には、信頼性スコア付与部50は、対象文字列に含まれる各文字列間の各接合スコアの信頼性スコアとして、単語境界存否確率記憶部96に記憶されている各接合スコアに対応する単語境界存否確率を付与する。各接合スコアの信頼性スコアを付与した信頼性スコア付与部50は、各接合スコアの信頼性スコアを学習データ更新部60に出力する。   In FIG.5 (c), the extraction part 40 extracts an object character string from the labeled data storage part 94 (step S210). The extraction unit 40 that has extracted the target character string outputs the target character string to the reliability score assigning unit 50. The reliability score assigning unit 50 that has acquired the target character string assigns a reliability score to each joint score between the character strings included in the target character string (step S220). Specifically, the reliability score assigning unit 50 corresponds to each joint score stored in the word boundary existence probability storage unit 96 as the reliability score of each joint score between the character strings included in the target character string. The probability of existence / non-existence of word boundaries is given. The reliability score assigning unit 50 to which the reliability score of each joint score is assigned outputs the reliability score of each joint score to the learning data update unit 60.

各接合スコアの信頼性スコアを取得した学習データ更新部60は、対象文字列内の単語境界を判定する(ステップS230)。対象文字列内の各単語境界の存否を判定した学習データ更新部60は、単語境界の判定結果に基づいて、ラベル有データ記憶部90に記憶されている品詞無単語データ(学習データ)を更新する(ステップS240)。そして、図5(c)に示すフローチャートは終了する。   The learning data update unit 60 that has acquired the reliability score of each joint score determines a word boundary in the target character string (step S230). The learning data updating unit 60 that has determined the presence or absence of each word boundary in the target character string updates the part-of-speech non-word data (learning data) stored in the labeled data storage unit 90 based on the determination result of the word boundary. (Step S240). Then, the flowchart shown in FIG.

以上、本発明の実施形態による単語境界判定装置1によれば、接合スコアに対する信頼性に基づいて単語境界を判定するため、文字列内の単語境界を高い精度で判定することができるようになる。   As described above, according to the word boundary determination device 1 according to the embodiment of the present invention, since the word boundary is determined based on the reliability with respect to the joint score, the word boundary in the character string can be determined with high accuracy. .

また、単語境界の判定結果に基づいて、学習データ(ラベル有データ記憶部90に記憶されている品詞無単語データ)を更新するため、接合スコア記憶部92に記憶される接合スコアの値の信頼性が向上し、単語境界推定部20による単語境界の推定の信頼性が向上し、ラベル有データ記憶部94に記憶される品詞無単語データの信頼性が向上する。換言すれば、形態素解析を行う際に発生する未知語の単語境界の推定の精度が向上する。よって、未知語に対して割り当てるべき品詞を高精度に推定することができるようになる。   Further, since the learning data (part-of-speech nonword data stored in the labeled data storage unit 90) is updated based on the determination result of the word boundary, the reliability of the value of the joint score stored in the joint score storage unit 92 is trusted. Thus, the reliability of the word boundary estimation by the word boundary estimation unit 20 is improved, and the reliability of the part-of-speech non-word data stored in the labeled data storage unit 94 is improved. In other words, the accuracy of estimating the word boundary of an unknown word that occurs when performing morphological analysis is improved. Therefore, the part of speech to be assigned to the unknown word can be estimated with high accuracy.

一般に、ラベル有データを学習用データとして使用する教師あり学習に基づく単語境界推定方式の場合、半教師あり学習を行う際に、ラベル判定済みデータは、再帰的に学習される(取り込まれる)。しかし、誤ったラベルがデータに付与されていた場合、誤ったラベルが付与されたデータも再帰的に学習されるため、単語境界推定の精度低下が発生するという問題がある。しかしながら、上記実施形態による単語境界判定装置1を上記単語境界推定方式に適用した場合、信頼性スコアを用いて接合スコアの信頼性を評価するため、信頼性の高いラベル(確からしいラベル)が付与されたデータのみが再帰的学習が学習されるようになり、再帰的に学習しても、単語境界推定の精度低下を極力抑えることができるようになる。   In general, in the case of a word boundary estimation method based on supervised learning using labeled data as learning data, label-determined data is recursively learned (captured) when semi-supervised learning is performed. However, if an incorrect label is added to the data, the data with the incorrect label is also learned recursively, resulting in a problem that the accuracy of word boundary estimation is reduced. However, when the word boundary determination device 1 according to the above embodiment is applied to the word boundary estimation method, a reliability label (probable label) is assigned to evaluate the reliability of the joint score using the reliability score. Recursive learning is learned only for the data that has been performed, and even if recursive learning is performed, it is possible to suppress a decrease in the accuracy of word boundary estimation as much as possible.

つまり、従来、ユーザ(人手)によって入力した学習データのみを使用するのは非効率であるため、効率的にラベル判定済みデータを再帰的に学習していた。しかし、再帰的に学習すると単語境界推定の精度が低下するという問題があった。この問題に対し、単語境界判定装置1では、ラベル判定済みデータを再帰的に学習しても単語境界推定の精度低下が抑えられるため、効率的に、かつ、高精度に、単語境界推定を行うことができるようになる。   That is, conventionally, since it is inefficient to use only learning data input by a user (manual), the label-determined data is efficiently learned recursively. However, recursive learning has a problem that the accuracy of word boundary estimation decreases. In response to this problem, the word boundary determination device 1 can efficiently and highly accurately estimate the word boundary because the accuracy of the word boundary estimation can be suppressed even when the label-determined data is learned recursively. Will be able to.

なお、本発明の一実施形態による単語境界判定装置1の各処理を実行するためのプログラムをコンピュータ読み取り可能な記録媒体に記録して、当該記録媒体に記録されたプログラムをコンピュータシステムに読み込ませ、実行することにより、本発明の一実施形態による単語境界判定装置1に係る上述した種々の処理を行ってもよい。なお、ここでいう「コンピュータシステム」とは、OSや周辺機器等のハードウェアを含むものであってもよい。また、「コンピュータシステム」は、WWWシステムを利用している場合であれば、ホームページ提供環境(あるいは表示環境)も含むものとする。また、「コンピュータ読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ROM、フラッシュメモリ等の書き込み可能な不揮発性メモリ、CD−ROM等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置のことをいう。   Note that a program for executing each process of the word boundary determination device 1 according to the embodiment of the present invention is recorded on a computer-readable recording medium, and the program recorded on the recording medium is read by a computer system. By executing, the above-described various processes related to the word boundary determination device 1 according to an embodiment of the present invention may be performed. Here, the “computer system” may include an OS and hardware such as peripheral devices. Further, the “computer system” includes a homepage providing environment (or display environment) if a WWW system is used. The “computer-readable recording medium” means a flexible disk, a magneto-optical disk, a ROM, a writable nonvolatile memory such as a flash memory, a portable medium such as a CD-ROM, a hard disk built in a computer system, etc. This is a storage device.

さらに「コンピュータ読み取り可能な記録媒体」とは、インターネット等のネットワークや電話回線等の通信回線を介してプログラムが送信された場合のサーバやクライアントとなるコンピュータシステム内部の揮発性メモリ(例えばDRAM(Dynamic Random Access Memory))のように、一定時間プログラムを保持しているものも含むものとする。また、上記プログラムは、このプログラムを記憶装置等に格納したコンピュータシステムから、伝送媒体を介して、あるいは、伝送媒体中の伝送波により他のコンピュータシステムに伝送されてもよい。ここで、プログラムを伝送する「伝送媒体」は、インターネット等のネットワーク(通信網)や電話回線等の通信回線(通信線)のように情報を伝送する機能を有する媒体のことをいう。また、上記プログラムは、前述した機能の一部を実現するためのものであっても良い。さらに、前述した機能をコンピュータシステムにすでに記録されているプログラムとの組み合わせで実現できるもの、いわゆる差分ファイル(差分プログラム)であっても良い。   Further, the “computer-readable recording medium” means a volatile memory (for example, DRAM (Dynamic DRAM) in a computer system that becomes a server or a client when a program is transmitted through a network such as the Internet or a communication line such as a telephone line. Random Access Memory)), etc., which hold programs for a certain period of time. The program may be transmitted from a computer system storing the program in a storage device or the like to another computer system via a transmission medium or by a transmission wave in the transmission medium. Here, the “transmission medium” for transmitting the program refers to a medium having a function of transmitting information, such as a network (communication network) such as the Internet or a communication line (communication line) such as a telephone line. The program may be for realizing a part of the functions described above. Furthermore, what can implement | achieve the function mentioned above in combination with the program already recorded on the computer system, what is called a difference file (difference program) may be sufficient.

以上、この発明の実施形態について図面を参照して詳述してきたが、具体的な構成はこの実施形態に限られるものではなく、この発明の要旨を逸脱しない範囲の設計等も含まれる。   The embodiment of the present invention has been described in detail with reference to the drawings. However, the specific configuration is not limited to this embodiment, and includes designs and the like that do not depart from the gist of the present invention.

1 単語境界判定装置 10 接合スコア算出部10 単語境界推定部 30 単語境界存否確率算出部 40 抽出部 50 信頼性スコア付与部 60 学習データ更新部 90 ラベル有データ記憶部(人手) 92 接合スコア記憶部 94 ラベル有データ記憶部(機械) 96 単語境界存否確率記憶部 DESCRIPTION OF SYMBOLS 1 Word boundary determination apparatus 10 Joint score calculation part 10 Word boundary estimation part 30 Word boundary existence probability calculation part 40 Extraction part 50 Reliability score provision part 60 Learning data update part 90 Labeled data storage part (manual) 92 Joint score storage part 94 Labeled data storage unit (machine) 96 Word boundary existence probability storage unit

Claims (3)

文字列間の接合度を示す接合スコア毎に、または、前記接合スコアの範囲に応じて分類された接合スコア群毎に、文字列間の単語境界の存否の確率を示す単語境界存否確率を記憶する単語境界存否確率記憶部と、
前記接合スコアの信頼性を示す信頼性スコアを付与する信頼性スコア付与部と、
単語境界の判定対象の文字列に含まれる文字列間の前記接合スコアの前記信頼性スコアを用いて当該文字列内の単語境界を判定する判定部と
を備え、
前記信頼性スコア付与部は、
一の文字列間の前記接合スコアの前記信頼性スコアとして、前記単語境界存否確率記憶部に記憶されている当該文字列間の前記接合スコアに対応する前記単語境界存否確率を付与し、
前記判定部は、
単語境界に係る情報を有する単語境界の判定対象の文字列に含まれる文字列間の前記接合スコアの前記信頼性スコアに基づいて、n番目の文字列間の前記接合スコア(E )の信頼性スコアのうち、単語境界が存在する確率を信頼性スコアP(E 、単語境界が存在しない確率を信頼性スコアP(E としたときに、前記接合スコア(E )に対して夫々のエントロピー値(I )を、前記信頼性スコアP(E と前記信頼性スコアP(E とを用いて算出し、
算出した夫々のエントロピー値(I )と所定の閾値(C )との大小関係に基づいて、当該文字列内の単語境界の存否を判定することを特徴とする単語境界判定装置。
Stores word boundary presence / absence probabilities indicating the probability of presence / absence of word boundaries between character strings for each joint score indicating the degree of connection between character strings or for each joint score group classified according to the range of the joint scores. A word boundary existence probability storage unit,
A reliability score giving unit that gives a reliability score indicating the reliability of the joint score;
A determination unit that determines a word boundary in the character string using the reliability score of the joint score between character strings included in a character string to be determined as a word boundary;
The reliability score granting unit
As the reliability score of the joint score between one character string, the word boundary existence probability corresponding to the joint score between the character strings stored in the word boundary existence probability storage unit is given ,
The determination unit
The reliability of the joint score (E n ) between the nth character strings based on the reliability score of the joint score between the character strings included in the character string to be determined for the word boundary having information related to the word boundary. Of the sex scores, when the probability that a word boundary exists is a reliability score P (E n ) A and the probability that no word boundary exists is a reliability score P (E n ) B , the joint score (E n ) Each entropy value (I n ) is calculated using the reliability score P (E n ) A and the reliability score P (E n ) B ,
A word boundary determination device characterized by determining the presence or absence of a word boundary in the character string based on a magnitude relationship between each calculated entropy value (I n ) and a predetermined threshold (C 1 ) .
文字列間の接合度を示す接合スコア毎に、または、前記接合スコアの範囲に応じて分類された接合スコア群毎に、文字列間の単語境界の存否の確率を示す単語境界存否確率を記憶する単語境界存否確率記憶部と、
前記接合スコアの信頼性を示す信頼性スコアを付与する信頼性スコア付与部と、
単語境界の判定対象の文字列に含まれる文字列間の前記接合スコアの前記信頼性スコアを用いて当該文字列内の単語境界を判定する判定部と
を備え、
前記信頼性スコア付与部は、
一の文字列間の前記接合スコアの前記信頼性スコアとして、前記単語境界存否確率記憶部に記憶されている当該文字列間の前記接合スコアに対応する前記単語境界存否確率を付与し、
前記判定部は、
単語境界に係る情報を有しない単語境界の判定対象の文字列に含まれる文字間の前記接合スコアおよび前記接合スコアの前記信頼性スコアに基づいて、n番目の文字列間の前記接合スコア(E )の信頼性スコアのうち、単語境界が存在する確率を信頼性スコアP(E 、単語境界が存在しない確率を信頼性スコアP(E としたときに、前記接合スコア(E )に対して夫々のエントロピー値(I )を、前記信頼性スコアP(E と前記信頼性スコアP(E とを用いて算出し、
算出した夫々のエントロピー値(I )と所定の閾値(C )との大小関係、および、前記接合スコア(E )と所定の閾値(C )との大小関係とに基づいて、当該文字列内の単語境界の存否を判定することを特徴とする単語境界判定装置。
Stores word boundary presence / absence probabilities indicating the probability of presence / absence of word boundaries between character strings for each joint score indicating the degree of connection between character strings or for each joint score group classified according to the range of the joint scores. A word boundary existence probability storage unit,
A reliability score giving unit that gives a reliability score indicating the reliability of the joint score;
A determination unit that determines a word boundary in the character string using the reliability score of the joint score between character strings included in a character string to be determined as a word boundary;
The reliability score granting unit
As the reliability score of the joint score between one character string, the word boundary existence probability corresponding to the joint score between the character strings stored in the word boundary existence probability storage unit is given ,
The determination unit
Based on the joint score between the characters included in the character string to be determined for the word boundary that does not have information related to the word boundary and the reliability score of the joint score, the joint score between the nth character strings (E n ) The reliability score P (E n ) A is the probability that a word boundary exists, and the reliability score P (E n ) B is the probability that a word boundary does not exist. entropy value of each relative to (E n) of (I n), calculated using said confidence score P (E n) a and the confidence score P (E n) B,
Based on the magnitude relationship between each calculated entropy value (I n ) and a predetermined threshold (C 1 ), and the magnitude relationship between the joint score (E n ) and the predetermined threshold (C 2 ), A word boundary determination device characterized by determining whether or not a word boundary exists in a character string.
第1の文字列と第2の文字列との間の前記接合スコアを算出する接合スコア算出部を更に備え、
前記接合スコア算出部は、
文章内において前記第1の文字列に続いて前記第2の文字列が出現した第1の出現回数と、文章内において前記第1の文字列に続いて前記第2の文字列以外の文字列が出現した第2の出現回数と、文章内において前記第1の文字列以外の文字列に続いて前記第2の文字列が出現した第3の出現回数と、文章内において前記第1の文字列以外の文字列に続いて前記第2の文字列以外の文字列が出現した第4の出現回数とを集計し、前記第1の出現回数、前記第2の出現回数、前記第3の出現回数および前記第4の出現回数に基づいて、前記第1の文字列と第2の文字列との間の前記接合スコアを算出することを特徴とする請求項1又は請求項2に記載の単語境界判定装置。
A joint score calculating unit for calculating the joint score between the first character string and the second character string;
The joint score calculation unit
A first number of appearances of the second character string following the first character string in the sentence, and a character string other than the second character string following the first character string in the sentence A second appearance count of the first character string, a third appearance count of the second character string that appears after the character string other than the first character string in the sentence, and the first character in the sentence. The fourth occurrence count of the occurrence of the character string other than the second character string following the character string other than the row, and the first appearance count, the second appearance count, and the third appearance count. number and on the basis of said fourth number of occurrences of the word according to claim 1 or claim 2, characterized in that to calculate the joint score between said first character string and second character string Boundary determination device.
JP2010006049A 2010-01-14 2010-01-14 Word boundary judgment device Active JP5566704B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2010006049A JP5566704B2 (en) 2010-01-14 2010-01-14 Word boundary judgment device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP2010006049A JP5566704B2 (en) 2010-01-14 2010-01-14 Word boundary judgment device

Publications (2)

Publication Number Publication Date
JP2011145885A JP2011145885A (en) 2011-07-28
JP5566704B2 true JP5566704B2 (en) 2014-08-06

Family

ID=44460680

Family Applications (1)

Application Number Title Priority Date Filing Date
JP2010006049A Active JP5566704B2 (en) 2010-01-14 2010-01-14 Word boundary judgment device

Country Status (1)

Country Link
JP (1) JP5566704B2 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102647657B1 (en) * 2021-02-25 2024-03-15 고려대학교 산학협력단 Method and apparatus for screening literature

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5286125B2 (en) * 2009-03-24 2013-09-11 Kddi株式会社 Word boundary determination device and morphological analysis device

Also Published As

Publication number Publication date
JP2011145885A (en) 2011-07-28

Similar Documents

Publication Publication Date Title
CN107797984B (en) Intelligent interaction method, equipment and storage medium
CN107729322B (en) Word segmentation method and device and sentence vector generation model establishment method and device
US9679558B2 (en) Language modeling for conversational understanding domains using semantic web resources
US9524291B2 (en) Visual display of semantic information
JP5831951B2 (en) Dialog system, redundant message elimination method, and redundant message elimination program
EP3819785A1 (en) Feature word determining method, apparatus, and server
JP2020522044A5 (en)
CN109783631B (en) Community question-answer data verification method and device, computer equipment and storage medium
JP2019536119A (en) User interest identification method, apparatus, and computer-readable storage medium
CN104156349B (en) Unlisted word discovery and Words partition system and method based on statistics dictionary model
KR20170053527A (en) Apparatus and method for evaluating machine translation quality using distributed representation, machine translation apparatus, and apparatus for constructing distributed representation model
US10795878B2 (en) System and method for identifying answer key problems in a natural language question and answering system
CN109376222A (en) Question and answer matching degree calculation method, question and answer automatic matching method and device
CN107690634A (en) Automatic query pattern generation
US8407047B2 (en) Guidance information display device, guidance information display method and recording medium
CN112560452A (en) Method and system for automatically generating error correction corpus
US9886498B2 (en) Title standardization
JP5286125B2 (en) Word boundary determination device and morphological analysis device
US9256597B2 (en) System, method and computer program for correcting machine translation information
CN113705207A (en) Grammar error recognition method and device
JP5566704B2 (en) Word boundary judgment device
US9104755B2 (en) Ontology enhancement method and system
JP2019204415A (en) Wording generation method, wording device and program
KR100890404B1 (en) Method and Apparatus for auto translation using Speech Recognition
JP6097707B2 (en) Data updating apparatus, method, and program

Legal Events

Date Code Title Description
A621 Written request for application examination

Free format text: JAPANESE INTERMEDIATE CODE: A621

Effective date: 20120907

A521 Written amendment

Free format text: JAPANESE INTERMEDIATE CODE: A821

Effective date: 20120910

A977 Report on retrieval

Free format text: JAPANESE INTERMEDIATE CODE: A971007

Effective date: 20130819

A131 Notification of reasons for refusal

Free format text: JAPANESE INTERMEDIATE CODE: A131

Effective date: 20130903

A131 Notification of reasons for refusal

Free format text: JAPANESE INTERMEDIATE CODE: A131

Effective date: 20140401

A521 Written amendment

Free format text: JAPANESE INTERMEDIATE CODE: A523

Effective date: 20140522

A521 Written amendment

Free format text: JAPANESE INTERMEDIATE CODE: A821

Effective date: 20140523

TRDD Decision of grant or rejection written
A01 Written decision to grant a patent or to grant a registration (utility model)

Free format text: JAPANESE INTERMEDIATE CODE: A01

Effective date: 20140610

A61 First payment of annual fees (during grant procedure)

Free format text: JAPANESE INTERMEDIATE CODE: A61

Effective date: 20140618

R150 Certificate of patent or registration of utility model

Ref document number: 5566704

Country of ref document: JP

Free format text: JAPANESE INTERMEDIATE CODE: R150

S533 Written request for registration of change of name

Free format text: JAPANESE INTERMEDIATE CODE: R313533

R350 Written notification of registration of transfer

Free format text: JAPANESE INTERMEDIATE CODE: R350