JP3904025B2

JP3904025B2 - Character string dividing device and recording medium

Info

Publication number: JP3904025B2
Application number: JP2005290403A
Authority: JP
Inventors: 多田　　智之
Original assignee: Omron Corp
Current assignee: Omron Corp
Priority date: 2005-10-03
Filing date: 2005-10-03
Publication date: 2007-04-11
Anticipated expiration: 2018-03-11
Also published as: JP2006059377A

Description

この発明は、日本語や中国語など単語毎の分かち書きがない言語の文章解析の分野に関するものである。 The present invention relates to the field of sentence analysis of languages such as Japanese and Chinese that do not have word-by-word separation.

日本語や中国語などの言語は、英語などの西洋言語とは異なり、文字で表現した場合、単語毎の分かち書きがされないため、その文字列からは単語の切れ目が不明である。そこで、日本語の文章解析ではまず、連続した文字列から単語・文節を区切る分割処理を行う必要がある。この分割の方法として従来より種々のものが提案されている（例えば、特許文献１を参照）。 A language such as Japanese or Chinese is different from a Western language such as English, and when expressed in characters, the word breaks are unclear from the character string because the words are not separated. Therefore, in Japanese sentence analysis, first, it is necessary to perform a dividing process for separating words and phrases from a continuous character string. Various methods for this division have been proposed (see, for example, Patent Document 1).

よく知られた分割方法として「字種切り法」がある。この字種切り法は、文字列中でひらがなと他の文字種との変わり目で予め文字列を分割することにより、高速に単語，文節を区切る方法である。しかし、全ての文字種の変わり目で文字列を分割してしまうと、本来は単語の切れ目ではない部分まで分割しすぎてしまう過剰分割をしてしまうおそれがあるため、この過剰分割を防止する方法として、たとえば、文節辞書と呼ばれる辞書を用いたものなどが提案されている。文節辞書とは、文字種の変わり目であっても文節の切れ目でないものを登録した辞書であり、入力文字列を文字種の変わり目で分割するとき、この文節辞書に含まれる単語と比較することにより過剰分割を救済している。
特開平０６−０９６１１５号公報 A well-known division method is a “character type cutting method”. This character type cutting method is a method of dividing words and phrases at high speed by dividing the character string in advance at the transition between hiragana and other character types in the character string. However, if a character string is divided at the transition of all character types, there is a risk of over-division that would otherwise divide too much into a portion that is not a word break, so as a method to prevent this over-division For example, a dictionary using a phrase dictionary is proposed. A phrase dictionary is a dictionary in which characters that are not at the break of a phrase are registered even when the character type changes, and when the input character string is divided at the change of the character type, it is over-divided by comparing with the words contained in this phrase dictionary Have bailout.
Japanese Patent Laid-Open No. 06-096115

しかし、上記文節辞書に含まれる単語は長さが不定長であるため、文節辞書の各単語と入力文字列との一致・不一致を比較するためには膨大な演算が必要であり、処理速度の低下を招くという問題点があった。 However, since the words included in the phrase dictionary have an indefinite length, enormous operations are required to compare the match / mismatch between each word in the phrase dictionary and the input character string. There was a problem of causing a drop.

この発明の目的は、簡略なテーブル検索のみで文字列の分割の可否を判定できるようにして上記課題を解決することにある。 SUMMARY OF THE INVENTION An object of the present invention is to solve the above-mentioned problem by determining whether or not a character string can be divided by only a simple table search.

一般的な文は単語が連続した構造になっており、入力された文を単語に分割する場合は、その単語の切れ目で分割すればよい。入力された文の意味内容が不明なままその文を単語に分割する場合、使用される可能性のある単語に含まれている文字列があるところでは分割せず、使用される可能性のある単語のどれにも含まれていない文字列の所で分割すれば誤りなく文を分割することができる。 A general sentence has a structure in which words are continuous, and when an inputted sentence is divided into words, the word may be divided at the breaks of the words. If you break the sentence into words without knowing the meaning of the input sentence, it may be used instead of being divided where there is a character string included in the word that may be used. You can divide a sentence without error by dividing it at a character string that is not included in any of the words.

本発明はこの処理を効率的に行うためのものである。本発明は、複数種類の２文字列と、この２文字間を分割点とするかどうかの符号とを対応付けて記憶する分割禁止符号記憶手段と、入力された文字列のうち１文字を指し示すポインタと、このポインタの示す直後にひらがなが続くか否かを判断する判断手段と、前記判断手段によってひらがなが続くと判断された場合に、前記ポインタの示す文字とこの文字の直後の文字との間を分割点とせずにポインタを進める手段と、前記判断手段によってひらがなが続くと判断されなかった場合に、前記ポインタの示す文字とこの文字の直後の文字とで前記分割禁止符号記憶手段を参照して、前記符号を検出する手段と、前記検出された符号に応じて、前記ポインタの示す文字とこの文字の直後の文字との間を分割して、前記ポインタを進める手段と、を有する文字列分割装置である。 The present invention is intended to efficiently perform this process. The present invention relates to a division prohibition code storage means for storing a plurality of types of two character strings and a code indicating whether or not a division point is between these two characters, and indicates one character among the input character strings. A pointer, a determination means for determining whether or not a hiragana continues immediately after the pointer indicates, and a character indicated by the pointer and a character immediately after the character when the determination means determines that the hiragana continues Refers to the division prohibition code storage means between the character indicated by the pointer and the character immediately after this character when the determination means does not determine that the hiragana character continues, and the means for advancing the pointer without dividing the interval Means for detecting the code; and means for advancing the pointer by dividing the character indicated by the pointer and the character immediately after the character according to the detected code; A character string dividing apparatus having.

また、本発明は、コンピュータを、複数種類の２文字列と、この２文字間を分割点とするかどうかの符号とを対応付けて記憶する分割禁止符号記憶手段と、入力された文字列のうち１文字を指し示すポインタと、このポインタの示す直後にひらがなが続くか否かを判断する判断手段と、前記判断手段によってひらがなが続くと判断した場合に、前記ポインタの示す文字とこの文字の直後の文字との間を分割点とせずにポインタを進める手段と、前記判断手段によってひらがなが続くと判断されなかった場合に、前記ポインタの示す文字とこの文字の直後の文字とで前記分割禁止符号記憶手段を参照して、前記符号を検出する手段と、前記検出された符号に応じて、前記ポインタの示す文字とこの文字の直後の文字との間を分割して、前記ポインタを進める手段と、して機能させるプログラムを記録した記録媒体である。 In addition, the present invention provides a computer that stores a plurality of types of two character strings and a division prohibition code storage unit that associates and stores a code indicating whether or not a division point is between the two characters. A pointer that points to one character, a determination unit that determines whether or not a hiragana character follows immediately after the pointer indicates, and a character that the pointer indicates and a character that immediately follows this character when the determination unit determines that a hiragana character continues. Means for advancing the pointer without using the character as a dividing point, and the character indicating the pointer and the character immediately after the character when the determining means does not determine that the hiragana character continues. Referring to the storage means, the means for detecting the code, and the character indicated by the pointer and the character immediately after the character are divided according to the detected code, and the poi Means for advancing the data, a recording medium recording a program to be functionally.

なお、文から抽出した２文字で検索した符号が２文字間を分割すべきではないと示していても、その文ではそこが単語の切れ目である場合がある（すなわち、この分割テーブルは文を単語毎に分割する十分条件であるが必要条件ではない）が、最終的な分割は単語辞書や連接可能性辞書を用いた形態素解析の段階で行うこととし、この形態素解析をスピーディーに行うための事前分割としてこれを用いれば、確実な分割点を高速に発見することができる。 In addition, even if the code searched with two characters extracted from the sentence indicates that the two characters should not be divided, there may be a break of words in the sentence (that is, this division table shows the sentence This is a sufficient condition to divide for each word, but it is not a necessary condition), but the final division is performed at the stage of morpheme analysis using a word dictionary or a connectability dictionary, and this morpheme analysis is performed quickly If this is used as a pre-division, a reliable division point can be found at high speed.

この発明によれば、簡略な処理で大まかではあるが確実な単語分割を行うことができる。 According to the present invention, rough but reliable word division can be performed with simple processing.

図面を参照してこの発明の形態素解析システムについて説明する。この形態素解析システムは、入力された文字列に対して事前に分割テーブルを用いて簡略な単語分割を行っておき、この単語分割された文字列に対して単語辞書を用いた正確な単語分割（形態素解析）を行うようにすることにより、高精度の解析を行いつつ作業時間の短縮を図っている。 A morphological analysis system of the present invention will be described with reference to the drawings. This morphological analysis system performs simple word division on the input character string using a division table in advance, and performs accurate word division using a word dictionary on the character divided character string ( By performing morphological analysis), working time is reduced while performing highly accurate analysis.

まず、前記分割テーブルについて説明する。図１は分割テーブルの概略を示す図である。分割テーブルは、入力される文に用いられる文字（数字、アルファベット、仮名文字、漢字を含む）を行列（縦横）それぞれのアドレスとする２次元配列のテーブルである。なお、この分割テーブルで太枠で囲んだ部分は、アドレスに対応する文字を仮想的に割り当てたもので、実際には図示の文字に対応する文字コードから算出した値（例えば、文字集合の中で何番目の文字か）をアドレスとする。表の「…」は表の一部が省略されていることを示す。 First, the division table will be described. FIG. 1 is a diagram showing an outline of a division table. The division table is a two-dimensional array table in which characters (including numbers, alphabets, kana characters, and kanji characters) used in an input sentence are addresses of respective matrices (vertical and horizontal). The portion surrounded by a thick frame in this division table is a virtual assignment of the character corresponding to the address. Actually, the value calculated from the character code corresponding to the illustrated character (for example, in the character set) And what number character is). "..." in the table indicates that a part of the table is omitted.

行アドレスおよび列アドレスで指定される各欄は１ビットであり、このビットの意味は以下のとおりである。 Each column designated by the row address and the column address is 1 bit, and the meaning of this bit is as follows.

０（リセット）：行アドレスの文字と列アドレスの文字からなる部分文字列は、その２文字の間で確実に単語分割される。 0 (reset): The partial character string composed of the character at the row address and the character at the column address is surely divided into words between the two characters.

１（セット）：行アドレスの文字と列アドレスの文字からなる部分文字列は、その２文字は同一単語内の文字列の可能性があり、その間で分割すると過剰分割の可能性がある。 1 (set): In a partial character string composed of characters of a row address and a character of a column address, the two characters may be character strings in the same word, and if divided between them, there is a possibility of excessive division.

各欄のビットが上記ルールでセット／リセットされた分割テーブルを作成するために、以下のような分割テーブル作成処理を行う。形態素解析を行うための単語辞書５（図３参照）から単語を読み出し、この単語から連続する２文字の部分文字列を抽出する。この部分文字列の第１文字・第２文字をそれぞれ行アドレス・列アドレスとして指定される欄のビットをセットする。 In order to create a partition table in which the bits in each column are set / reset by the above rules, the following partition table creation process is performed. A word is read from the word dictionary 5 (see FIG. 3) for performing morphological analysis, and a two-character partial character string continuous from the word is extracted. The bit of the column designated by using the first character and the second character of the partial character string as the row address and the column address is set.

たとえば、「べた書き」とい単語について説明すると、この単語から「べた」「た書」「書き」の３種類の部分文字列を抽出することができる。これらの部分文字列は１つの単語「べた書き」中に含まれるものであるため、この部分文字列が入力文中に現れた場合、この間で入力文を分割すると単語（例えば「べた書き」）が途中で分割されてしまう可能性がある。そこで、文中から「べた」という２文字が抽出されたときこれらの文字の間で文を分割しないように「べ」を行アドレス、「た」を列アドレスとして指定される欄のビットをセットする。同様に「た書」「書き」についても、「た」を行アドレス、「書」を列アドレスとして指定される欄のビットをセットし、「書」を行アドレス、「き」を列アドレスとして指定される欄のビットをセットする。この処理を単語辞書に登録されている全ての単語について行うことにより、分割テーブルは、１単語内で発生する可能性のある部分文字列の全てについてビットがセットされ、ビットがリセットされている欄を行アドレス、列アドレスで指定する部分文字列については、それらの文字間で文を分割しても単語を途中で分割することがないことが保証されるようになる。 For example, the word “solid writing” will be explained. From this word, three types of partial character strings “solid”, “ta writing”, and “writing” can be extracted. Since these partial character strings are included in one word “solid writing”, when this partial character string appears in the input sentence, the word (for example, “solid writing”) is divided by dividing the input sentence between them. There is a possibility of being divided along the way. Therefore, when two characters “BET” are extracted from the sentence, the bit in the column designated with “B” as the row address and “TA” as the column address is set so that the sentence is not divided between these characters. . In the same way, for “ta” and “writing”, the bits in the column designated as “ta” as the row address and “call” as the column address are set, “call” as the row address and “ki” as the column address. Set the bit for the specified field. By performing this process for all words registered in the word dictionary, the division table is a column in which bits are set and bits are reset for all partial character strings that may occur in one word. For partial character strings that are designated by a row address and a column address, it is guaranteed that a word will not be divided in the middle even if a sentence is divided between those characters.

なお、分割テーブルの行アドレス，列アドレスは文字コードをそのまま用いてもよく、各文字を文字コードよりも簡略なコード（番号など）に変換し、これあアドレスとして用いるものであってもよい。 For the row address and column address of the division table, the character code may be used as it is, or each character may be converted into a simpler code (number or the like) than the character code and used as an address.

この分割テーブルを用いる事前分割装置２（図３参照）は、入力された文の先頭から順次２文字ずつを抽出して分割テーブルを参照し、その２文字で指定される欄のビットがリセットしていた場合のみその文字間を分割する（分割マークを挿入する）ことで高速の事前分割を実現している。 The pre-division device 2 (see FIG. 3) using this division table extracts two characters at a time from the beginning of the input sentence, refers to the division table, and resets the bit in the column designated by the two characters. Only in the case where the character has been divided, a high-speed pre-division is realized by dividing the character (inserting a division mark).

なお、この事前分割処理が高速であるのは、不等長の文字列同士の比較ではなく、文字数が２文字に定まっており、また、この２文字で行う処理がテーブルの検索のみであるためであり、この処理によって入力された文をある程度まで分割しておくことにより、後段の形態素解析では分割されたブロック毎に解析を行えばよいことになり、処理の大幅な軽減を図ることができる。 The reason why the pre-division processing is high-speed is that the number of characters is determined to be two characters, not comparison of unequal length character strings, and the processing performed with these two characters is only a table search. By dividing the sentence input by this processing to a certain extent, the morphological analysis in the subsequent stage only needs to perform analysis for each divided block, and the processing can be greatly reduced. .

なお、解析する言語を日本語とし、入力され得る文字をＪＩＳ−Ｘ−０２０８で規定された文字集合とすると、この文字集合は８８３６文字からなるため、上記分割テーブルは８８３６×８８３６の２次元のテーブルとなる。テーブルが占有する記憶容量やテーブルを参照する速度など考慮し、この２次元配列をそのまま用いることもできるが、ハッシュテーブルを用いる方法なども考えられる。 If the language to be analyzed is Japanese, and the characters that can be input are character sets defined in JIS-X-0208, the character set is composed of 8836 characters. Therefore, the partition table is a two-dimensional 8836 × 8836. It becomes a table. This two-dimensional array can be used as it is in consideration of the storage capacity occupied by the table and the speed of referring to the table, but a method using a hash table is also conceivable.

また、表を用いて判定するのは使用頻度の高い例えば第１水準の文字のみに限定してもよい。この場合、後述の事前分割装置では、その他の文字（第２水準等）が２文字のどちらかに含まれる場合は一括して「事前分割しない」と判定するようにすればよい。すなわち、出願頻度が低いため、事前分割せずに形態素解析装置で分割処理を行っても全体の処理速度はほとんど低下しない。 Further, the determination using the table may be limited to only the first level characters having high usage frequency. In this case, in the pre-division apparatus described later, if other characters (second level or the like) are included in either of the two characters, it may be determined collectively that “no pre-division”. In other words, since the filing frequency is low, the overall processing speed is hardly lowered even if the division processing is performed by the morphological analysis device without performing preliminary division.

簡略な例を挙げて入力文の事前分割について具体的に説明する。以下の２つの例文について単語分割する場合について説明する。 The prior division of the input sentence will be specifically described with a simple example. A case where words are divided for the following two example sentences will be described.

Ａ埋め込んだシリコン層を熱酸化して
Ｂ経時的なしきい値電圧の変動防止
ここで、分割テーブルは、図２に示すように「しきい値」「しきい値電圧」「突き抜け」「埋め込」「埋込み金属」「シリコン層」「熱酸化」「経時的」「変動」「防止」の単語から抽出した部分文字列でビットがセットされているものとする。 A Thermal oxidation of the embedded silicon layer B Prevention of fluctuation of threshold voltage over time Here, the division table includes “threshold”, “threshold voltage”, “pierce” and “embedding” as shown in FIG. It is assumed that the bit is set in a partial character string extracted from the words “buried metal” “silicon layer” “thermal oxidation” “temporal” “variation” “prevention”.

この分割テーブルを参照して上記例文を事前分割すると、
Ａ埋め込んだ｜シリコン層を｜熱酸化して
Ｂ経時的なしきい値電圧の｜変動｜防止
となる。ここで「｜」が分割点のマークである。例文Ａにおいて、上記分割テーブルでは「だシ」「を熱」のビットがセットされていないため、この部分で分割可能と判断され、分割マークが挿入されている。また、例文Ｂにおいて、上記分割テーブルでは「の変」「動防」のビットがセットされていないため、この部分で分割可能と判断され、分割マークが挿入されている。なお、「経時的な」と「しきい値」との間に分割マークをセットしないのは、基本的な分割法として字種切り法を採用し、ひらがな同士では分割しないようにしているためである。なお、字種切り法を用いないでひらがな同士であっても分割するようにしてもよい。このように、簡略なテーブル検索で極めて精度のよい入力文字列の事前分割が可能になる。 When the above example sentence is pre-divided with reference to this division table,
A: The embedded silicon layer is thermally oxidized to prevent the fluctuation of the threshold voltage over time. Here, “|” is a mark of a dividing point. In the example sentence A, since the bit of “Dashi” and “Hot” is not set in the above division table, it is determined that division is possible at this portion, and a division mark is inserted. Also, in the example sentence B, since the “change” and “motion prevention” bits are not set in the division table, it is determined that division is possible at this portion, and a division mark is inserted. Note that the reason why the division mark is not set between “time-lapsed” and “threshold” is that the character division method is adopted as the basic division method, and the hiragana characters are not divided. is there. Note that the hiragana characters may be divided without using the character type cutting method. In this way, it is possible to pre-divide an input character string with extremely high accuracy by a simple table search.

図３は、上記分割テーブルを用いた文章解析装置の機能ブロック図である。形態素解析が行われる文字列１は、日本語など単語毎の分かち書きがない言語のものである。また、日本語の場合、漢字かな混じり文として記述された文字列を対象としている。この文字列はまず事前分割装置２に入力される。事前分割装置２はこの文字列を分割テーブル３を用いて事前分割する。 FIG. 3 is a functional block diagram of the sentence analysis apparatus using the division table. The character string 1 to be subjected to morphological analysis is in a language such as Japanese where there is no word-by-word segmentation. In the case of Japanese, a character string described as a kanji-kana mixed sentence is targeted. This character string is first input to the pre-dividing device 2. The pre-dividing device 2 pre-divides this character string using the division table 3.

事前分割装置２は入力文字列の先頭から順次２文字ずつを取り出し、その２文字の間で文字列を分割することができるかを分割テーブル３を参照して判断する。そして、この分割テーブルによる判断結果に以下の判定ルールを考慮して最終的に分割するか否かを決定する。 The pre-splitting device 2 takes out two characters sequentially from the beginning of the input character string, and determines whether the character string can be split between the two characters with reference to the partition table 3. Then, the decision result based on this division table is determined in consideration of the following decision rule to determine whether or not to finally divide.

「長さ２文字の部分文字列のうち、後の１文字がひらがなの場合は過剰分割の可能性があるためビットがリセットしていても分割しない。」その理由として、単語の後に続く助詞，助動詞などの「付属語」はその殆どがひらがなであるが、付属語は前の単語と一括して形態素解析装置４で解析するほうが好ましいためである。 “If the last character of a two-character partial character string is hiragana, it may be over-divided, so it will not be divided even if the bit is reset.” This is because most of the “adjunct words” such as auxiliary verbs are hiragana, but it is preferable that the adjunct words are analyzed together with the previous word by the morphological analyzer 4.

以上の判定により、分割できると判断した場合には、その間に分割マークを挿入する。 If it is determined by the above determination that the image can be divided, a division mark is inserted between them.

そして、事前分割を完了した文字列は形態素解析装置４に入力される。 Then, the character string that has been pre-divided is input to the morphological analyzer 4.

形態素解析装置４は、上記分割テーブルの作成にも使用した単語辞書５および連接可能性辞書６を備え、事前分割装置２で事前分割されたブロック毎に形態素解析を行って単語を当てはめ、各単語の品詞などの属性情報を読み出し、最終的に文全体の連接可能性に基づいて解析結果を決定する。 The morpheme analyzer 4 includes a word dictionary 5 and a connectability dictionary 6 that are also used to create the partition table. The morpheme analyzer 4 performs morpheme analysis on each block that has been pre-divided by the pre-divider 2 and applies a word to each word. Attribute information such as part of speech is read out, and finally the analysis result is determined based on the possibility of connection of the whole sentence.

図４の形態素解析の例を参照して、形態素解析のアルゴリズムを説明する。 The algorithm of morphological analysis will be described with reference to the example of morphological analysis in FIG.

（１）入力文中の位置（文字と文字の間）を示すポインタを用意する。初期状態としてポインタを位置０（先頭の文字の左側）におく。また文頭という仮想的なノードを置く。 (1) A pointer indicating a position (between characters) in the input sentence is prepared. As an initial state, the pointer is placed at position 0 (left side of the first character). A virtual node called the beginning of the sentence is also placed.

（２）ポインタ位置から始まる語を辞書で検索する。図４の例の位置０からは「この（連体詞）」と「こ（接尾辞：個）」が検索される。 (2) Search the dictionary for words starting from the pointer position. From the position 0 in the example of FIG. 4, “this (conjunction)” and “ko (suffix: pieces)” are searched.

（３）ポインタ位置で終わっている語（位置０の場合は文頭ノード）とポインタ位置から始まる語の各ペアについて連接可能なものがあればその間にリンクをはる。ポインタ位置から始まる語の中でいずれの語とも連接可能でない語は排除する。図の例の位置０の場合「この」はリンクされ「こ」は排除される。 (3) A link is established between each pair of a word ending at the pointer position (a head node in the case of position 0) and a word starting from the pointer position, if any of them can be connected. The word which cannot be connected with any word among the words which start from a pointer position is excluded. In the case of position 0 in the figure, “this” is linked and “ko” is excluded.

（４）ポインタ位置から右側を順番に調べ、そこで終わる語が存在する位置までポインタを移動する。同図の例では位置０の次は位置２，その次は位置３へと移動する。 (4) The right side from the pointer position is examined in order, and the pointer is moved to a position where a word ending there exists. In the example shown in the figure, the position next to position 0 moves to position 2, and the next position moves to position 3.

（５）ポインタが文末に来るまでステップ（２），（３），（４）を繰り返す。文末の場合には、文末という仮想的なノードをおき、文末位置で終わっている語と文末との連接可能性を調べ、可能なものだけて文末ノードにリンクして処理を終了する。 (5) Repeat steps (2), (3) and (4) until the pointer reaches the end of the sentence. In the case of the end of the sentence, a virtual node called the end of the sentence is set, the possibility of connection between the word ending at the end of the sentence and the end of the sentence is examined, and only possible ones are linked to the end of the sentence to end the process.

（６）最終的に、文頭ノードから文末ノードまでのパス（ノードとリンクの並び）が入力文に対する形態素解析結果となっている。 (6) Finally, the path from the sentence start node to the sentence end node (sequence of nodes and links) is the morphological analysis result for the input sentence.

以上の処理で基本的な形態素解析は終了したが、図示で明らかなように、このままでは不適切な解も多く含まれている。そこで、何らかの優先規則によってもっともらしい解だけを選択することが必要になる。 Although the basic morphological analysis has been completed by the above processing, as is apparent from the drawing, there are many unsuitable solutions as they are. Therefore, it is necessary to select only plausible solutions according to some priority rule.

優先規則として一般的に用いられているのは、できるだけ長い語によって構成される解、あるいはできるだけ少ない語数の解を優先させるというものである。この考え方に論理的な根拠は認められないが、図示の例でもそうであるように妥当でない解のかなりの部分が不必要に短い語を含むものであることから、直観的にその良さを認めることができる。 Generally used as a priority rule is to prioritize a solution composed of as long a word as possible or a solution with as few words as possible. Although there is no logical basis for this idea, as in the example shown, a considerable part of the invalid solution contains unnecessarily short words. it can.

以上の優先規則によって最終的な形態素解析結果を決定し、次段に出力する。次段は、たとえば、構文・意味解析装置である。 The final morpheme analysis result is determined according to the above priority rules and output to the next stage. The next stage is, for example, a syntax / semantic analyzer.

なお、図２の例文を形態素解析する場合、分割された各ブロック「埋め込んだ」「シリコン層を」「熱酸化して」「経時的なしきい値電圧の」「変動」「防止」毎に上記形態素解析を行えばよく、連接可能性のみについて連続するブロック間で確認する。 When the morphological analysis of the example sentence in FIG. 2 is performed, each of the divided blocks “embedded”, “silicon layer”, “thermally oxidized”, “time-lapse threshold voltage”, “fluctuation”, and “prevention” are described above. Morphological analysis may be performed, and only the possibility of connection is confirmed between consecutive blocks.

以上説明した文書解析装置は、パーソナルコンピュータとソフトウェアで実現される場合が多い。以下、上記実施形態の文章解析装置の機能を実行するパーソナルコンピュータのハードウェアとその処理部の動作について説明する。 The document analysis apparatus described above is often realized by a personal computer and software. Hereinafter, the hardware of the personal computer that executes the function of the sentence analysis apparatus of the above embodiment and the operation of the processing unit will be described.

図５は文章解析装置の機能を実行するパーソナルコンピュータの構成を示す図、および、ＲＡＭ２１ａおよびハードディスクドライブ２２の記憶エリアを示す図である。同図（Ａ）において、パーソナルコンピュータ本体２０は、ＣＰＵやメモリを搭載したメインボード２１と内蔵の周辺機器であるハードディスク２２、ＣＤ−ＲＯＭドライブ２３、フロッピィディスクドライブ２４、モデム２５などで構成される。また、このパーソナルコンピュータ本体２０には、モニタ３０、キーボード３１、マウス３２などが接続されている。上記文章解析装置の機能を実行するためのアプリケーションプログラムは、通常はハードディスク２２に記憶されており、その処理動作を実行するときメインボード２１のＲＡＭに読み込まれる。また、このアプリケーションプログラム（分割テーブル、単語辞書、連接可能性辞書を含む）は、ＣＤ−ＲＯＭドライブ２３にセットされるＣＤ−ＲＯＭ３３やフロッピィディスクドライブ２４にセットされるフロッピィディスク３４などによって供給され、ハードディスク２２にインストールされる。また、モデム２５を介して接続されるサーバコンピュータ４０からこの動作プログラムをダウンロードするようにしてもよい。 FIG. 5 is a diagram showing a configuration of a personal computer that executes the function of the sentence analysis device, and a diagram showing storage areas of the RAM 21 a and the hard disk drive 22. In FIG. 2A, a personal computer main body 20 is composed of a main board 21 equipped with a CPU and a memory, a built-in peripheral device such as a hard disk 22, a CD-ROM drive 23, a floppy disk drive 24, a modem 25, and the like. . The personal computer main body 20 is connected to a monitor 30, a keyboard 31, a mouse 32, and the like. The application program for executing the function of the sentence analysis apparatus is normally stored in the hard disk 22 and is read into the RAM of the main board 21 when executing the processing operation. The application program (including a division table, a word dictionary, and a connectability dictionary) is supplied by a CD-ROM 33 set in the CD-ROM drive 23, a floppy disk 34 set in the floppy disk drive 24, and the like. Installed on the hard disk 22. Alternatively, the operation program may be downloaded from the server computer 40 connected via the modem 25.

同図（Ｂ）において、このパーソナルコンピュータが起動すると、ＲＡＭ２１ａには、ハードディスクドライブ２２からシステムプログラム２１０が読み込まれる。また、文章解析プログラム（アプリケーションプログラム）が起動されると、このアプリケーションプログラムの実行のためにアプリケーション領域２１１が確保され、このなかに文章解析プログラム２１２および分割テーブル２１３が読み込まれるとともに、ワークエリア２１４が確保される。ワークエリア２１４には、解析される文を記憶するエリアやハードディスク２２の単語辞書や連接可能性辞書のキャッシュエリアなどが設定される。 In FIG. 5B, when this personal computer is activated, the system program 210 is read from the hard disk drive 22 into the RAM 21a. When the sentence analysis program (application program) is started, an application area 211 is secured for execution of the application program, in which the sentence analysis program 212 and the division table 213 are read and the work area 214 is Secured. In the work area 214, an area for storing a sentence to be analyzed, a word dictionary of the hard disk 22, a cache area of a connectability dictionary, and the like are set.

同図（Ｃ）において、ハードディスクドライブ２２には、システムプログラム２２０およびアプリケーションプログラムである文章解析プログラム２２１が記憶されている。システムプログラム２２０はそのパーソナルコンピュータの起動時にＲＡＭ２１ａに読み込まれ、文章解析プログラム２２１はこのプログラムの起動時にＲＡＭ２１ａに読み込まれる。また、このアプリケーションプログラムと一体に前記分割テーブル２２２、単語辞書２２３、連接可能性辞書２２４も記憶されている。分割テーブル２２２は文章解析プログラムが起動されるときに、プログラムとともにＲＡＭ２１ａに読み込まれるが、単語辞書２２３および連接可能性辞書２２４はサイズが大きいため、ハードディスク２２上に記憶されたまま参照され、その一部がキャッシュとしてＲＡＭ２１ａに読み込まれる。 In FIG. 2C, the hard disk drive 22 stores a system program 220 and a text analysis program 221 that is an application program. The system program 220 is read into the RAM 21a when the personal computer is activated, and the sentence analysis program 221 is read into the RAM 21a when the program is activated. Further, the division table 222, the word dictionary 223, and the connection possibility dictionary 224 are stored together with the application program. The division table 222 is read into the RAM 21a together with the program when the sentence analysis program is started. However, the word dictionary 223 and the concatenation possibility dictionary 224 are large in size and are referred to while being stored on the hard disk 22, and one of them. Are loaded into the RAM 21a as a cache.

図６は上記パーソナルコンピュータの事前分割動作を示すフローチャートである。まず、入力文字列の最初の文字（の左側）にポインタを置く（ｓ１）。ｓ２ではポインタが入力文字列の最後に到達したかを判断し、最後に到達した場合には動作を終了する。最後に到達していない場合にはｓ３以下の動作を実行する。したがって、最初はｓ１からｓ３に進むことになる。 FIG. 6 is a flowchart showing the pre-division operation of the personal computer. First, a pointer is placed on the first character (on the left side) of the input character string (s1). In s2, it is determined whether the pointer has reached the end of the input character string, and if it has reached the end, the operation is terminated. If it has not reached the end, the operation after s3 is executed. Therefore, at first, the process proceeds from s1 to s3.

ｓ３では現在ポインタがある文字の次の文字がひらがなかを判断する。ひらがなの場合にはポインタのある文字と次の文字の間を分割点としないで（ｓ５）、ポインタを１つ先に進め（ｓ８）、ｓ２に戻る。すなわち、この処理では基本的な分割法として字種切り法を採用しており、また、上記ルール「部分文字列のうち、後の１文字がひらがなの場合は分割しない」を採用しているため、ひらがなの手前で分割することがないためである。 In s3, it is determined whether the character next to the character with the current pointer is hiragana. In the case of hiragana, the division point is not set between the character with the pointer and the next character (s5), the pointer is advanced one step (s8), and the process returns to s2. That is, in this process, the character type cutting method is adopted as a basic division method, and since the above rule "Do not divide when the next character in the partial character string is hiragana" is adopted. This is because there is no division in front of hiragana.

一方、ポインタの次の文字がひらがなでない場合には、ポインタのある文字と次の文字の部分文字列で分割テーブルを参照し、その欄のビットを読み出す（ｓ４）。読み出されたビットが１の場合には過剰分割となる可能性があるため、この部分では分割しないで（ｓ５）、ポインタを１つ進める（ｓ８）。読み出したビットが０の場合にはこの位置で分割できるため、ポインタのある文字と次の文字の間を分割点としてマークを挿入し（ｓ７）、ポインタを１つ進める（ｓ８）。ポインタが文末に到達するまで（ｓ２）、上記動作を繰り返し実行する。 On the other hand, if the next character of the pointer is not hiragana, the division table is referred to by the character with the pointer and the partial character string of the next character, and the bit in the column is read (s4). If the read bit is 1, there is a possibility of excessive division, so this portion is not divided (s5), and the pointer is advanced by one (s8). When the read bit is 0, it is possible to divide at this position, so a mark is inserted between the character with the pointer and the next character as a dividing point (s7), and the pointer is advanced by 1 (s8). The above operation is repeated until the pointer reaches the end of the sentence (s2).

図７は、前記分割テーブルを作成場合の処理動作を示すフローチャートである。まず上記分割テーブルのエリアを設定し、全ての欄のビットを０に初期化する（ｓ１１）。そして単語辞書から最初の単語を読み出す（ｓ１２）。読み出した単語の最初の文字にポインタを置く（ｓ１４）。ポインタのある文字と次の文字をアドレスにして分割テーブルの欄を検索し、その欄のビットを１にセットする（ｓ１７）。そしてポインタを１つ右にずらし（ｓ１８）。ポインタが単語の最後の文字に到達するまでｓ１５以下の処理を繰り返す。ポインタが単語の最後の文字に到達すると、辞書から次の単語を読み出して（ｓ１６）、ｓ１３以下の動作を繰り返す。これを読み出す単語が終了するまで実行して（ｓ１３）、分割テーブルが完成する。 FIG. 7 is a flowchart showing a processing operation when the division table is created. First, the area of the partition table is set, and the bits in all the columns are initialized to 0 (s11). Then, the first word is read from the word dictionary (s12). A pointer is placed on the first character of the read word (s14). The division table column is searched using the character with the pointer and the next character as an address, and the bit of the column is set to 1 (s17). Then, the pointer is shifted to the right by one (s18). The processing from s15 is repeated until the pointer reaches the last character of the word. When the pointer reaches the last character of the word, the next word is read from the dictionary (s16), and the operations after s13 are repeated. This is executed until the word to be read is completed (s13), and the division table is completed.

この分割テーブル作成動作は、単語辞書の作成に合わせて事前に行っておき、文章解析プログラムと一緒に分割テーブルが供給されるようにしてもよい。また、同様の分割テーブルが、単なるコピーなどこれ以外の手順で作成され得る場合であっても、その分割テーブルを記憶した媒体、その分割テーブルを用いた文章解析装置、文章解析方法も本願特許請求の範囲に包含されるものである。 This division table creation operation may be performed in advance according to the creation of the word dictionary, and the division table may be supplied together with the sentence analysis program. Further, even if the same division table can be created by other procedures such as simple copying, a medium storing the division table, a sentence analysis device using the division table, and a sentence analysis method are also claimed in this application. It is included in the range.

この発明に用いられる分割テーブルの例を示す図The figure which shows the example of the division | segmentation table used for this invention 分割テーブルおよび事前分割処理の具体例を示す図The figure which shows the specific example of a division | segmentation table and pre-division processing この発明の実施形態である文章解析装置の機能ブロック図Functional block diagram of a sentence analysis apparatus according to an embodiment of the present invention 形態素解析の例を示す図Diagram showing an example of morphological analysis 同文章解析装置が実現されるパーソナルコンピュータのブロック図Block diagram of a personal computer that implements the sentence analysis device 同パーソナルコンピュータの動作を示すフローチャートFlow chart showing operation of the personal computer 分割テーブル作成動作を示すフローチャートFlow chart showing partition table creation operation

Explanation of symbols

１…入力文字列、２…事前分割装置、３…分割テーブル、
４…形態素解析装置、５…単語辞書、６…連接可能性辞書、
７…単語情報出力
２１…メインボード、２１ａ…ＲＡＭ、２２…ハードディスクドライブ、
２３…ＣＤ−ＲＯＭドライブ、３３…ＣＤ−ＲＯＭ、
２４…フロッピィディスクドライブ、３４…フロッピィディスク、
３１…キーボード、３２…マウス 1 ... input character string, 2 ... pre-dividing device, 3 ... division table,
4 ... Morphological analyzer, 5 ... Word dictionary, 6 ... Concatenation possibility dictionary,
7 ... Word information output 21 ... Main board, 21a ... RAM, 22 ... Hard disk drive,
23 ... CD-ROM drive, 33 ... CD-ROM,
24 ... floppy disk drive, 34 ... floppy disk,
31 ... Keyboard, 32 ... Mouse

Claims

A division prohibition code storage means for storing a plurality of types of two character strings and a code indicating whether or not a division point is set between the two characters;
A pointer that points to one character in the input string;
A judging means for judging whether or not the hiragana follows immediately after the pointer indicates;
Means for advancing the pointer without using a dividing point between the character indicated by the pointer and the character immediately after the character when it is determined that the hiragana character continues by the determining means;
Means for detecting the code by referring to the division prohibition code storage means with the character indicated by the pointer and the character immediately after the character when the determination means does not determine that the hiragana character continues;
Means for advancing the pointer by dividing the character indicated by the pointer and the character immediately after the character according to the detected code;
A character string dividing device having

Computer
A division prohibition code storage means for storing a plurality of types of two character strings and a code indicating whether or not a division point is set between the two characters;
A pointer that points to one character in the input string;
A judging means for judging whether or not the hiragana follows immediately after the pointer indicates;
Means for advancing the pointer without using a dividing point between the character indicated by the pointer and the character immediately after the character when the determination unit determines that the hiragana character continues;
Means for detecting the code by referring to the division prohibition code storage means with the character indicated by the pointer and the character immediately after the character when the determination means does not determine that the hiragana character continues;
Means for advancing the pointer by dividing the character indicated by the pointer and the character immediately after the character according to the detected code;
A recording medium that records a program that allows it to function.