JP3143906B2

JP3143906B2 - Device for determining the presence of unknown words

Info

Publication number: JP3143906B2
Application number: JP02038916A
Authority: JP
Inventors: 貢三浦
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 1990-02-19
Filing date: 1990-02-19
Publication date: 2001-03-07
Anticipated expiration: 2016-03-07
Also published as: JPH03240865A

Description

【発明の詳細な説明】〔産業上の利用分野〕本発明は、ワードプロセッサ，機械翻訳及びデータベ
ースへの問い合わせ等の文字列の処理を必要とするシス
テムにおいて、処理する文字列に未知語が含まれている
か否かを判定するために用いられる未知語の存在の判定
装置に関する。DETAILED DESCRIPTION OF THE INVENTION [Industrial Application Field] The present invention relates to a system that requires processing of a character string such as a word processor, machine translation, and inquiry to a database. The present invention relates to a device for determining the presence of an unknown word that is used to determine whether or not there is an unknown word.

[Conventional technology]

従来の未知語の存在を判定する判定装置は、文の文字
列と辞書に登録してある文字列とを比較し、文字列の分
割をいろいろと試み、どの様に分割しても辞書に登録さ
れている文字列と一致しない部分が発見されたならば未
知語があると判定する装置である。ここで、未知語とは
辞書に登録されていない文字列を意味する。The conventional determination device that determines the presence of an unknown word compares the character string of a sentence with the character string registered in the dictionary, attempts to divide the character string in various ways, and registers it in the dictionary no matter how it is divided Is a device that determines that there is an unknown word if a portion that does not match the given character string is found. Here, the unknown word means a character string that is not registered in the dictionary.

従来の装置で解決できる場合の例としては次のような
例があげられる。The following example is given as an example of a case that can be solved by a conventional device.

辞書には「京都」（きょうと）、「東京」（とうきょ
う）が登録されており、「東」（ひがし）、「都」（み
やこ）、「東京都」が登録されていない場合、文字列
「東京都」は、「東京」「都」または、「東」「京都」
と分割される。文字列「東」、文字列「都」は、辞書に
未登録であり、「東」「都」は、辞書の文字列と一致し
ない部分である。従って、「東京都」という文字列に
は、未知語が存在すると判定される。この場合未知語は
「東」「都」「東京都」のどれかである。In the dictionary, "Kyoto" (Tokyo) and "Tokyo" (Tokyo) are registered. If "East" (Higashi), "Miyako" (Miyako) and "Tokyo" are not registered, the character string ""Tokyo" means "Tokyo""Tokyo" or "East""Kyoto"
Is divided. The character strings “East” and “To” are not registered in the dictionary, and “East” and “To” are parts that do not match the character strings in the dictionary. Therefore, it is determined that an unknown word exists in the character string “Tokyo”. In this case, the unknown word is any of "east", "city", and "Tokyo".

又、従来の装置で解決できない場合の例としては次の
ような例がある。Further, as an example in which the conventional apparatus cannot solve the problem, there is the following example.

辞書には「東」（ひがし）、「京都」（きょうと）、
「東京」（とうきょう）、「都」（みやこ）の文字列が
登録されており「東京都」が登録されていない場合、従
来の装置では、未知語の「東京都」は、未知語ではなく
「東」「京都」（ひがしきょうと）または「東京」
「都」（とうきょうみやこ）とされてしまい未知語
「東京都」（とうきょうと）があることが解らない。The dictionary includes "East" (Higashi), "Kyoto" (Tokyo),
If character strings "Tokyo" (Tokyo) and "Miyako" (Miyako) are registered and "Tokyo" is not registered, the unknown word "Tokyo" is not an unknown word in the conventional device. "East""Kyoto" (Today) or "Tokyo"
It is said to be "Miyako" (Tokyo), and it is not clear that there is an unknown word "Tokyo" (Tokyo).

[Problems to be solved by the invention]

上述した従来の装置では、未知語が検出できるのは、
辞書に登録されている文字列と一致しない部分が発見さ
れた場合のみである。従って、従来の装置には、辞書に
登録されている文字列と一致しない部分がない場合、未
知語が存在しても未知語があると判定できず、特に辞書
に登録される文字列が増えれば増えるほど未知語の存在
を判定できなくなるという問題点がある。In the conventional device described above, unknown words can be detected by:
Only when a part that does not match the character string registered in the dictionary is found. Therefore, in the conventional apparatus, if there is no portion that does not match the character string registered in the dictionary, it is not possible to determine that there is an unknown word even if an unknown word exists, and in particular, the number of character strings registered in the dictionary increases. There is a problem in that the presence of an unknown word cannot be determined as the number of words increases.

[Means for solving the problem]

本発明の未知語の存在の判定装置は、（Ａ）未知語の
存在の有無を判定するために入力された第１の文字列を
構成する、文字数，前記第１の文字列の中に含まれる漢
字数及び連続する漢字の字数を用いて表現され、前記第
１の文字列に付与されて前記第１の文字列に前記未知語
が存在するか否かの有無を判定するときに使用される第
１の未知語判定コードを生成するコード生成手段、
（Ｂ）前記第１の文字列を切り分けて複数の第２の文字
列を生成し前記複数の第２の文字列から構成される文字
列群の中の前記各第２の文字列をそれぞれ、あらかじめ
登録されている第３の文字列と照合し、前記第３の文字
列と一致した前記第２の文字列に対して、前記第３の文
字列にあらかじめ付与された、前記第３の文字列の該当
する品詞を取得し、前記品詞の自立語としてのあらかじ
め決められた評価に対してあらかじめ決められ前記品詞
に付与された点数を、前記第３の文字列と一致した前記
第２の文字列に付与し、前記第３の文字列と一致しない
前記第２の文字列がある場合には再度、前回の切り分け
方と異なる切り分け方で前記第１の文字列を切り分けて
複数の第２の文字列を生成し直し、生成された前記第２
の文字列をそれぞれ前記第３の文字列と照合するという
ように、前記第３の文字列と一致しない前記第２の文字
列がある場合には、その都度それまでの切り分け方と異
なる切り分け方で前記第１の文字列を切り分けて複数の
第２の文字列を生成し直して、生成された前記第２の文
字列をそれぞれ前記第３の文字列と照合することを繰り
返し、前記第１の文字列に対する可能な全ての切り分け
方によって前記第１の文字列を切り分け前記第２の文字
列をそれぞれ前記第３の文字列と照合しても、なお前記
第３の文字列と一致しない前記第２の文字列がある場合
には、前記第１の文字列の中に未知語があるものとして
切り分け動作を停止し、前記第１の文字列を切り分けて
複数の第２の文字列を生成し、前記第１の文字列が切り
分けられて生成された前記複数の第２の文字列をそれぞ
れ前記第３の文字列と照合したとき、前記第３の文字列
と一致しない前記第２の文字列がなくなった場合には、
前記第３の文字列と一致しない前記第２の文字列がなく
なるまでに行った全ての異った切り分け方による切り分
けで得られた複数の前記文字列群を解析結果として出力
する際に、前記第２の文字列に付与した前記文字列群の
全ての前記点数を加算し、加算して得られた点数を前記
文字列群の点数として、前記複数の文字列群とともに出
力する形態素解析手段、（Ｃ）前記第３の文字列と、前
記第３の文字列に該当する前記品詞及び前記品詞付与さ
れた前記点数とが登録された辞書、（Ｄ）前記形態素解
析手段から出力された、前記解析結果を示す前記複数の
文字列群のうち、それぞれの群に付与された点数のもっ
とも小さい点数が付与された前記文字列群の数を計数
し、得られた数を第１の数値として出力する解析数計数
手段、（Ｅ）入力される前記第１の文字列の中の未知語
の存在の有無を判定するときに参照される、あらかじめ
決められた文字数，漢字数及び連続する漢字の字数の各
種の組み合せを用いてあらかじめ作成された第２の未知
語判定コードと、前記第２の未知語判定コードに付与さ
れた、前記未知語の存在の有無の判定を行うときの評価
の基準となる数値であり、前記第１の数値より小かまた
は前記第１の数値と等しい場合は前記第１の文字列の中
に未知語が存在すると判定することができる数値である
第２の数値とが登録された未知語判定コード表、（Ｆ）
前記第１の文字列の中に未知語の存在の有無を判定する
ため、前記コード生成手段により前記第１の文字列に付
与された第１の未知語判定コードと一致する前記第２の
未知語判定コードを前記未知語判定コード表で参照し、
該第２の未知語判定コードに付与されている前記第２の
数値を取得し、前記解析数計数手段から出力された前記
第１の数値と前記未知語判定コード表から取得した前記
第２の数値とを比較し、前記第２の数値が前記第１の数
値より小かまたは前記第１の数値と等しい場合は前記第
１の文字列の中に未知語が存在すると判定し、前記第２
の数値が前記第１の数値より大である場合は前記第１の
文字列の中に未知語が存在しないと判定する比較判定手
段、を備えて構成されている。The apparatus for determining the presence of an unknown word according to the present invention includes: (A) the number of characters constituting a first character string input for determining the presence or absence of an unknown word, the number being included in the first character string; And is used to determine whether or not the unknown word is present in the first character string, which is expressed using the number of kanji characters and the number of consecutive kanji characters. Code generation means for generating a first unknown word determination code,
(B) generating a plurality of second character strings by dividing the first character string, and converting each of the second character strings in a character string group composed of the plurality of second character strings into: The third character string previously added to the third character string with respect to the second character string that matches with the third character string by collating with a third character string registered in advance Acquiring the corresponding part of speech of the sequence, and determining the score given to the part of speech determined in advance for a predetermined evaluation of the part of speech as an independent word, the second character that matches the third character string If the second character string that does not match the third character string is assigned to the first character string, the first character string is cut again by a different cutting method from the previous cutting method, and a plurality of second character strings are cut. Regenerate the character string and generate the second
When there is the second character string that does not match the third character string, such as comparing each character string with the third character string, a different dividing method is used each time. The first character string is cut again to generate a plurality of second character strings, and the generated second character strings are repeatedly compared with the third character strings. The first character string is divided according to all possible division methods for the character string of the above, and even if the second character string is compared with the third character string, the first character string still does not match the third character string. If there is a second character string, it is determined that there is an unknown word in the first character string, and the dividing operation is stopped, and the first character string is divided to generate a plurality of second character strings. Then, the first character string is cut and generated. When said plurality of second string against the each of the third string, if the third of the second character string that does not match the string runs out,
When outputting as a result of analysis a plurality of the character string groups obtained by segmentation by all different segmentation methods performed until there is no second character string that does not match the third character string, A morphological analysis unit that adds all the scores of the character string group given to the second character string, and outputs the obtained score as the score of the character string group, together with the plurality of character string groups; (C) a dictionary in which the third character string, the part of speech corresponding to the third character string, and the score assigned to the part of speech are registered, (D) the morphological analysis unit outputs Among the plurality of character string groups indicating the analysis result, the number of the character string groups to which the smallest score assigned to each group is assigned is counted, and the obtained number is output as a first numerical value. Analysis number counting means to perform (E) input Created using various combinations of the predetermined number of characters, the number of kanji characters, and the number of consecutive kanji characters, which are referred to when determining the presence or absence of an unknown word in the first character string. A second unknown word determination code, a numerical value that is given to the second unknown word determination code and is a numerical value that becomes a reference for evaluation when determining whether or not the unknown word exists, and is based on the first numerical value. An unknown word determination code table in which, when the value is small or equal to the first numerical value, a second numerical value, which is a numerical value that can determine that an unknown word exists in the first character string, is registered; F)
In order to determine the presence / absence of an unknown word in the first character string, the second unknown word coincident with the first unknown word determination code given to the first character string by the code generation means. Refer to the word determination code in the unknown word determination code table,
The second numerical value given to the second unknown word determination code is obtained, and the first numerical value output from the analysis number counting means and the second numerical value obtained from the unknown word determination code table are obtained. Comparing the second numerical value with the first numerical value, if the second numerical value is smaller than the first numerical value or equal to the first numerical value, determining that an unknown word exists in the first character string;
When the numerical value of the first character string is larger than the first numerical value, a comparison determining unit that determines that there is no unknown word in the first character string is provided.

〔Example〕

次に、本発明の実施例について図面を参照して説明す
る。Next, embodiments of the present invention will be described with reference to the drawings.

第１図は本発明の一実施例の構成図である。 FIG. 1 is a configuration diagram of one embodiment of the present invention.

第１図に示す通り、本発明の未知語の存在の判定装置
は、形態素解析装置1,辞書2,解析数計数装置3,未知語判
定表4,比較装置５から構成される。As shown in FIG. 1, the apparatus for determining the presence of an unknown word according to the present invention comprises a morphological analyzer 1, a dictionary 2, an analysis counter 3, an unknown word determination table 4, and a comparator 5.

そして、動作の流れを第２図に示す。 FIG. 2 shows the operation flow.

第３図は形態素解析装置１の処理の概要を示す図であ
り、そのとき使用する辞書の例を第４図に示す。FIG. 3 is a diagram showing an outline of the processing of the morphological analyzer 1, and an example of a dictionary used at that time is shown in FIG.

なお、形態素解析とは、文字列中の単語の存在を確定
することである。例えば、「はなしがすみました」から
「話が済むますた」の様に文中の単語を確定す
ることである。「話ガス見ますた」や「話が
炭増すた」が正解でないことを判断することであ
る。このような形態素解析を行う装置を形態素解析装置
と呼ぶ。Note that morphological analysis is to determine the presence of a word in a character string. For example, to determine a word in a sentence, such as "I've finished talking" to "I'm done talking". It is to judge that "the story gas is seen" or "the story is more charcoal" is not the correct answer. An apparatus that performs such morphological analysis is called a morphological analyzer.

又、上記例の「話が済むますた」「話ガス
見ました」「話が炭益すた」を総称して解析
結果と呼ぶ。解析結果の中で正解及び正解と予想される
一群を解析候補と呼ぶ。解析結果の数を解析数、解析候
補の数を候補数と呼ぶ。又、辞書とは、ある文字列とそ
の文法属性を記したものである。Also, in the above example, "talk is over"
I have seen it, "and" the story has benefited. " A correct answer and a group expected to be a correct answer among the analysis results are referred to as analysis candidates. The number of analysis results is called an analysis number, and the number of analysis candidates is called a candidate number. The dictionary describes a certain character string and its grammatical attribute.

形態素解析装置１は以下の動作を行う。 The morphological analyzer 1 performs the following operation.

まず、入力された文字列を切り分ける。次に、切り分
けされた文字列を辞書に登録されている文字列と照らし
合わせる。そして、辞書に登録されている文字列と一致
しない部分があれば再び切り直す。どの様に切り分けて
も辞書に登録されている文字列と一致しない部分がある
ならば未知語があると判定して動作をとめる。もし、辞
書に登録されている文字列と一致しない部分がないなら
ば次の動作を行う。入力された文字列の切り分けた部分
と一致する文字列のデータを辞書から取り出し、辞書か
ら取り出されたデータのなかの点数をたしあわせる。そ
して、切り分けた文字列と点数を出力する。First, the input character string is separated. Next, the divided character string is compared with the character string registered in the dictionary. If there is a portion that does not match the character string registered in the dictionary, the portion is cut again. If there is a portion that does not match the character string registered in the dictionary, no matter how it is cut, it is determined that there is an unknown word, and the operation is stopped. If there is no part that does not match the character string registered in the dictionary, the following operation is performed. The character string data that matches the cut portion of the input character string is extracted from the dictionary, and the points in the data extracted from the dictionary are added up. Then, the divided character string and the score are output.

次に、辞書２は、文字列とデータからなり、データに
は点数が含まれており、品詞が自立語のものほど高い点
数がつけられている。Next, the dictionary 2 is composed of a character string and data, and the data includes a score. The higher the part of speech is the independent word, the higher the score is given.

上記の概要を第５図に示す。 The outline of the above is shown in FIG.

未知語判定表４は検索コードａと数値ｂからなる。検
索コードは、解析文字列の文字数などを基にして作られ
る。数値ｂは、文字数に比例した値となるように付けら
れている。この概要を第６図に示す。The unknown word determination table 4 includes a search code a and a numerical value b. The search code is created based on the number of characters of the analysis character string and the like. The numerical value b is set to be a value proportional to the number of characters. The outline is shown in FIG.

判定装置は、形態素解析において点数をつけて、もっ
とも少ない点数を持つものを選び出し、選び出された解
析数が多い場合、解析の対象となった文の中に未知語が
存在するとし、解析数が少ない場合未知語が存在しない
と判定する装置である。In the morphological analysis, the judgment device attaches a score and selects the one with the smallest score, and if the number of selected analyzes is large, it is assumed that an unknown word exists in the sentence to be analyzed, and the number of analysis is determined. If there are few unknown words, the unknown word does not exist.

一般に人間の用いる自然言語の１単語の文字数は、一
定の値に中におさまる。文字数とその言語に含まれる単
語数はある分布を持つ。その分布は、「文字数の多い単
語ほどその数は少なくなる傾向がある」という性質を持
つ。この性質は、日本語に限らず他の言語にも共通する
性質である。一単語に含まれる文字数が100文字を越え
るものはほとんど無いが３文字程度の単語は大量に存在
する。未知語が存在し、かつその未知語を小さな単位に
分解した場合、小単位の文字数の解析候補数が増加し、
文の解析数が増加する。例えば「水銀柱」と言う単語が
未知語である場合、「水」「銀」「柱」とも「水銀」
「柱」とも「水」「銀柱」とも分けることが可能となり
解析数が増加する。一文中の文字数から単語数と解析数
を予測し、その予測値よりも多くの解析結果が得られた
場合未知語があると判定できる。In general, the number of characters in one word of a natural language used by humans falls within a certain value. The number of letters and the number of words in the language have a certain distribution. The distribution has the property that "the number of characters tends to decrease as the number of words increases". This property is common not only to Japanese but also to other languages. Few words contain more than 100 characters in one word, but words with about three characters exist in large quantities. If an unknown word exists and the unknown word is decomposed into small units, the number of analysis candidates for the number of characters in the small unit increases,
The number of sentence parsing increases. For example, if the word “mercury column” is unknown, “water”, “silver” and “pillar” are both “mercury”
It can be separated from "pillar", "water" and "silver pillar", and the number of analysis increases. The number of words and the number of analyzes are predicted from the number of characters in one sentence, and if more analysis results are obtained than the predicted values, it can be determined that there is an unknown word.

未知語の判定装置は以下のように動作する。 The unknown word determination device operates as follows.

まず、入力文字列を得る。そして、入力文字列の文字
数を使ってコードａを得、さらに、形態素解析を行い入
力文字列を切り分けた結果と点数を得る。こうして得た
点数のもっとも低いものを第一候補とし、第一候補御と
なるものの数を数値ｃとし、数値ｃを得る。次に、コー
ドａを使って未知語判定表を引き、数値ｂを得る。そし
て、数値ｂと数値ｃを比較する。もし、数値ｂが数値ｃ
より小か等しいならば、未知語はあるとし、数値ｂが数
値ｃより大ならば、未知語がないとする。First, an input character string is obtained. Then, a code a is obtained by using the number of characters of the input character string, and a result and a score obtained by performing a morphological analysis and separating the input character string are obtained. The lowest score obtained in this way is defined as the first candidate, and the number of the first candidate is set as the numerical value c, and the numerical value c is obtained. Next, the unknown word determination table is drawn using the code a to obtain a numerical value b. Then, the numerical value b and the numerical value c are compared. If number b is number c
If it is smaller or equal, it is determined that there is an unknown word. If the numerical value b is larger than the numerical value c, it is determined that there is no unknown word.

次に、上記の処理の流れを例を使って説明する。 Next, the flow of the above processing will be described using an example.

「東京都」が未知語である（辞書に登録されていな
い）場合は次のようになる。If "Tokyo" is an unknown word (not registered in the dictionary), the result is as follows.

まず、形態素解析装置１は入力文字列「東京都に住
む」を得る。次に入力文字列の文字数を使って下記のよ
うにコードａを得る。First, the morphological analyzer 1 obtains an input character string “living in Tokyo”. Next, a code a is obtained as follows using the number of characters of the input character string.

文文字数６漢字数４連続漢字数３コードａ＝（６４３）次に、形態素解析を行い、入力文字列を切り分けた結
果と点数を得る。この時使われる辞書の例を第７図に示
す。この形態素解析で得られた結果の例を次に示す。Number of sentence characters 6 Number of kanji 4 Number of continuous kanji 3 Code a = (6 4 3) Next, morphological analysis is performed, and the result and score of the input character string are obtained. FIG. 7 shows an example of the dictionary used at this time. An example of the result obtained by this morphological analysis is shown below.

次に、解析数計数装置３は上記のうちで点数のもっと
も低いものを第一候補とする。第一候補は点数が32点の
ものである。従って以下のようになる。 Next, the analysis number counting device 3 sets the one with the lowest score among the above as the first candidate. The first candidate has 32 points. Thus:

次に、第一候補となるものの数を数値ｃとし、数値ｃ
を得る。すなわち、第一候補の数は２個なので、数値ｃ
＝２となる。さらに、比較装置５はコードａを使って未
知語判定表を引き数値ｂを得る。この時使われる未知語
判定表の例を第８図に示す。コードａは（６４３）
であるので、数値ｂ＝２となる。そして、数値ｂと数値
ｃを比較する。数値ｂ＝2,数値ｃ＝２であるので、数値
ｂ＝数値ｃである。もし、数値ｂが数値ｃより小か等し
いならば、未知語はあるとする。又、数値ｂが数値ｃよ
り大ならば、未知語がないとする。ここでは、数値ｂ＝
数値ｃであるから未知語があるいと判定する。 Next, the number of first candidates is set to a numerical value c, and the numerical value c
Get. That is, since the number of first candidates is two, the numerical value c
= 2. Further, the comparing device 5 obtains a numerical value b by using the code a to consult the unknown word determination table. FIG. 8 shows an example of the unknown word determination table used at this time. The code a is (64 3)
Therefore, the numerical value b = 2. Then, the numerical value b and the numerical value c are compared. Since the numerical value b = 2 and the numerical value c = 2, the numerical value b = the numerical value c. If the numerical value b is smaller than or equal to the numerical value c, it is determined that there is an unknown word. If the numerical value b is larger than the numerical value c, it is determined that there is no unknown word. Here, the numerical value b =
Since it is the numerical value c, it is determined that there is an unknown word.

「東京都」が未知語でない（辞書に登録してある）場
合には次のようになる。If "Tokyo" is not an unknown word (registered in the dictionary), the result is as follows.

まず、入力文字列「東京都に住む」を得る。 First, an input character string "living in Tokyo" is obtained.

次に、入力文字列の文字数を使って下記のようにコー
ドａを得る。Next, a code a is obtained as follows using the number of characters of the input character string.

文字数６漢字数４連続漢字数３１コードａ＝（６４３）次に、形態素解析を行い、入力文字列を切り分けた結
果と点数を得る。この形態素解析で得られた結果の例を
次に示す。Number of characters 6 Number of kanji 4 Number of continuous kanji 3 1 Code a = (6 4 3) Next, a morphological analysis is performed to obtain a result and a score obtained by separating the input character string. An example of the result obtained by this morphological analysis is shown below.

次に、上記のうちで点数のもっとも低いものを第一候
補とする。第一候補は点数が32点のものである。従って
以下のようになる。 Next, among the above, the one with the lowest score is the first candidate. The first candidate has 32 points. Thus:

次に、第一候補となるものの数を数値ｃとし、数値ｃ
を得る。すなわち、第一候補の数は１個なので、数値ｃ
＝１となる。さらに、コードａを使って未知語判定表を
引き数値ｂを得る。ここで、コードａは（６４３）
であるので数値ｂ＝２となる。そして、数値ｂと数値ｃ
を比較する。数値ｂ＝２数値ｃ＝１であるので、数値
ｂ＞数値ｃである。ここでは、数値ｂが数値ｃより大で
あるから、未知語がないと判定する。 Next, the number of first candidates is set to a numerical value c, and the numerical value c
Get. That is, since the number of first candidates is one, the numerical value c
= 1. Further, the unknown word determination table is subtracted using the code a to obtain a numerical value b. Here, the code a is (64 3)
Therefore, the numerical value b = 2. Then, the numerical value b and the numerical value c
Compare. Numerical value b = 2 Since numerical value c = 1, numerical value b> numerical value c. Here, since the numerical value b is larger than the numerical value c, it is determined that there is no unknown word.

〔The invention's effect〕

以上説明したように本発明は、未知語が文中に存在し
た場合、その文の解析数が増大すると言う現象を利用す
ることにより、従来の装置では判定できなかった未知語
の存在の有無を判定することができ、このことにより文
字列を機械で処理する場合の間違いを低下させることが
できるという効果がある。As described above, the present invention makes use of the phenomenon that when the unknown word is present in the sentence, the number of analyzes of the sentence is increased, thereby determining the presence or absence of the unknown word that could not be determined by the conventional device. This has the effect that errors in processing the character string by machine can be reduced.

また、文字列を機械で処理した場合、未知語があるこ
とを人間に知らせることにより、文字列の分割の訂正、
辞書に登録する文字列の発見などを容易に行うことがで
きるようになるという効果がある。Also, when a character string is processed by a machine, by notifying humans that there is an unknown word, correction of character string division,
There is an effect that a character string to be registered in the dictionary can be easily found.

[Brief description of the drawings]

第１図は本発明の一実施例の構成図、第２図は未知語の
存在の判定装置の動作を示す図、第３図は形態素解析装
置１の処理の概要を示す図、第４図は形態素解析装置１
が使用する辞書の例を示す図、第５図は辞書の形式を示
す図、第６図は未知語判定表の形式を示す図、第７図は
形態素解析に使用される辞書の例を示す図、第８図は未
知語判定表の例を示す図である。１……形態素解析装置、２……辞書、３……解析数計数
装置、４……未知語判定表、５……比較装置。FIG. 1 is a block diagram of one embodiment of the present invention, FIG. 2 is a diagram showing the operation of a device for determining the presence of an unknown word, FIG. 3 is a diagram showing an outline of the processing of the morphological analyzer 1, and FIG. Is a morphological analyzer 1
FIG. 5 shows an example of a dictionary used in FIG. 5, FIG. 5 shows a format of a dictionary, FIG. 6 shows a format of an unknown word determination table, and FIG. 7 shows an example of a dictionary used for morphological analysis. FIG. 8 is a diagram showing an example of the unknown word determination table. 1 morphological analysis device, 2 dictionary, 3 analysis number counting device, 4 unknown word judgment table, 5 comparison device.

Claims

(57) [Claims]

(A) The number of characters, the number of kanji included in the first character string, and consecutive kanji constituting a first character string input to determine the presence / absence of an unknown word Is determined using the number of characters, and is assigned to the first character string and used to determine whether or not the unknown word exists in the first character string. Code generating means for generating a code, (B) generating a plurality of second character strings by dividing the first character string, and selecting each of the character strings in a character string group composed of the plurality of second character strings. Each of the second character strings is compared with a third character string registered in advance, and the second character string that matches the third character string is assigned to the third character string in advance. The obtained part-of-speech corresponding to the third character string is obtained, and the part-of-speech A predetermined score assigned to the part of speech with respect to a predetermined evaluation is assigned to the second character string that matches the third character string, and does not match the third character string. If there is the second character string, the first character string is cut again by a different cutting method from the previous cutting method to generate a plurality of second character strings again.
When there is the second character string that does not match the third character string, such as comparing each character string with the third character string, a different dividing method is used each time. The first character string is cut again to generate a plurality of second character strings, and the generated second character strings are repeatedly compared with the third character strings. The first character string is divided according to all possible division methods for the character string of the above, and even if the second character string is compared with the third character string, the first character string still does not match the third character string. If there is a second character string, it is determined that there is an unknown word in the first character string, and the dividing operation is stopped, and the first character string is divided to generate a plurality of second character strings. Then, the first character string is cut and generated. When said plurality of second string against the each of the third string, if the third of the second character string that does not match the string runs out,
When outputting as a result of analysis a plurality of the character string groups obtained by segmentation by all different segmentation methods performed until there is no second character string that does not match the third character string, A morphological analysis unit that adds all the scores of the character string group given to the second character string, and outputs the obtained score as the score of the character string group, together with the plurality of character string groups; (C) a dictionary in which the third character string, the part of speech corresponding to the third character string, and the score assigned to the part of speech are registered. (D) the dictionary output from the morphological analysis unit. Among the plurality of character string groups indicating the analysis result, the number of the character string groups to which the smallest score assigned to each group is assigned is counted, and the obtained number is output as a first numerical value. (E) input Created using various combinations of the predetermined number of characters, the number of kanji characters, and the number of consecutive kanji characters, which are referred to when determining the presence or absence of an unknown word in the first character string. A second unknown word determination code and the second unknown word determination code,
A numerical value serving as a criterion for evaluation when judging the presence or absence of the unknown word. If the value is smaller than or equal to the first numerical value, the unknown character is included in the first character string. An unknown word determination code table in which a second numerical value, which is a numerical value that can determine that a word exists, is registered; (F) in order to determine the presence or absence of an unknown word in the first character string, The second unknown word determination code table refers to the second unknown word determination code that matches the first unknown word determination code given to the first character string by the code generation means, and Obtaining the second numerical value given to the determination code, comparing the first numerical value output from the analysis number counting means and the second numerical value obtained from the unknown word determination code table, The second numerical value is smaller than the first numerical value or the second numerical value When it is equal to the numerical value of 1, it is determined that an unknown word exists in the first character string, and when the second numerical value is larger than the first numerical value, it is determined in the first character string. An apparatus for determining the presence of an unknown word, comprising: comparison determination means for determining that an unknown word does not exist.