JPH0289157A - Japanese language morpheme analyzing system - Google Patents
Japanese language morpheme analyzing systemInfo
- Publication number
- JPH0289157A JPH0289157A JP63239956A JP23995688A JPH0289157A JP H0289157 A JPH0289157 A JP H0289157A JP 63239956 A JP63239956 A JP 63239956A JP 23995688 A JP23995688 A JP 23995688A JP H0289157 A JPH0289157 A JP H0289157A
- Authority
- JP
- Japan
- Prior art keywords
- word
- connection
- string
- degree
- grammatical
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000004458 analytical method Methods 0.000 claims abstract description 12
- 230000001186 cumulative effect Effects 0.000 claims abstract description 6
- 238000000034 method Methods 0.000 claims description 30
- 230000000877 morphologic effect Effects 0.000 claims description 17
- 239000011159 matrix material Substances 0.000 claims description 15
- 239000002131 composite material Substances 0.000 abstract 2
- 238000000926 separation method Methods 0.000 abstract 2
- 150000001875 compounds Chemical class 0.000 description 5
- 238000010586 diagram Methods 0.000 description 3
- 230000011218 segmentation Effects 0.000 description 3
- 238000007619 statistical method Methods 0.000 description 2
- 230000021615 conjugation Effects 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
Landscapes
- Machine Translation (AREA)
- Document Processing Apparatus (AREA)
Abstract
Description
【発明の詳細な説明】
筑提分災
本発明は、日本語文章の形態素解析に関し、特に、二文
節最長法と単語間接続度を用いて容易に精度良く形態素
(単語)に分割する処理に関する。[Detailed Description of the Invention] The present invention relates to morphological analysis of Japanese sentences, and particularly relates to the process of easily and accurately dividing Japanese sentences into morphemes (words) using the longest two-clause method and the degree of connectivity between words. .
(米援権
B本語テキストの形態素解析を行い、形態素を抽出する
手法としては、一般に知られている最長−教法、二文節
最長法、文節数最小法などがあり、ワードプロセッサな
どのかな漢字変換処理に利用されている。しかし、これ
らの方法では漢字が連続する複合語の単語分割に関して
精度が低い。この欠点を補う為、近年では、統計的な方
法や単語の意味分類等を利用して、漢字連続の複合語に
対して特別な解析を施している。統計的な方法では、予
め多くの漢字字種について統計量を収集処理している、
単語の意味を用いるためには予め単語辞書中の各単語に
ついて適切な意味分類を施している。(Methods for performing morphological analysis and extracting morphemes from the original Japanese text of Beiengen B include the generally known longest-kyoho, two-bunsetsu longest method, and least-number-of-bunsetsu method. However, these methods have low accuracy when it comes to word segmentation of compound words consisting of consecutive kanji.In order to compensate for this drawback, in recent years, statistical methods and word meaning classification have been used. , a special analysis is performed on compound words consisting of consecutive kanji.In the statistical method, statistics are collected and processed for many kanji character types in advance.
In order to use the meanings of words, appropriate meaning classification is applied to each word in the word dictionary in advance.
また、接続行列は従来、上記の方法の処理中で、連続す
る二つの単語間の接続の可否を判定したり、一つの先行
する単語に接続する複数の単語候補から一つの単語を選
択したりする為に使用されている。Additionally, connection matrices have conventionally been used to determine whether or not two consecutive words can be connected, or to select one word from multiple word candidates that connect to one preceding word during the processing of the above method. is used to do.
このように従来方式によって日本語テキストの形態素解
析を精度良く行なう場合、最長−教法、二文節最長法、
文節数最小法などの解析方法と上記の特別な複合語処理
とを併用する必要があり。In this way, when performing morphological analysis of Japanese text with high accuracy using conventional methods, the longest - teaching method, the two-bunsetsu longest method,
It is necessary to use an analysis method such as the minimum clause count method together with the special compound word processing described above.
−様な処理で解析できず、また、一般にコストが高い。- It cannot be analyzed using various processing methods, and the cost is generally high.
目 的
本発明は、上述のごとき実情に鑑みてなされたもので、
特に、接続行列表の要素の値は文法的な接続の難易を表
すように重み付けられた接続度で、連続する二つの単語
間の接続の可否を判定するだけでなく、文節の単語並び
(単語列)が適切な文節候補を選択する為に、その接続
度を使用するようにして、漢字が連続する複合語の単語
分割に関しても−様な処理によって精度良く形態素解析
することができ、処理全体のコストも低くできるように
することを目的としてなされたものである。Purpose The present invention was made in view of the above-mentioned circumstances.
In particular, the values of the elements of the connection matrix table are weighted connections that represent the difficulty of grammatical connections, and are used not only to determine whether or not two consecutive words can be connected, but also to determine the word arrangement of a clause (words By using the degree of connectivity in order to select appropriate phrase candidates (sequences), morphological analysis can be performed with high precision even when word segmentation of compound words consisting of consecutive kanji is performed. This was done with the aim of lowering the cost.
隻−一底
本発明は、上記目的を達成するために、入力された日本
語文章を、あらかじめ作成された単語辞書中の単語との
マツチングにより単語単位あるいは形態素に分割する処
理において、単語の形態情報(品詞および活用形情報な
ど)と接続文法情報(直前に接続する単語に対する文法
的特性を示す前後文法特性、直後に接続する単語に対す
る文法的特性を示す後接文法特性)を対応づける形態接
続文法特性対応表と、前接文法特性と後接文法特性との
間の接続度を示す接続行列と、入力された文字列と前記
単語辞書中の単語文字列のマツチングによって単語の形
態情報を得る辞書検索処理と。In order to achieve the above object, the present invention uses morphological information of words in the process of dividing an input Japanese sentence into word units or morphemes by matching words in a word dictionary created in advance. Morphological conjunctive grammar that associates (part of speech and conjugation information, etc.) with conjunctive grammatical information (prefix and conjunctive grammatical properties that indicate the grammatical properties of the immediately preceding word, postfix grammatical properties that indicate the grammatical properties of the word that immediately follows) A characteristic correspondence table, a connection matrix indicating the degree of connection between prefix grammatical characteristics and postfix grammatical characteristics, and a dictionary that obtains word morphology information by matching input character strings with word strings in the word dictionary. Search processing and.
辞書検索で抽出された形態情報によって接続文法情報を
検索する形態接続文法特性対応表検索処理と、先行する
単語の後接文法特性と続く単語の前接文法特性の間の接
続度を検索する接続行列検索処理と、該接続行列検索よ
り得られた接続度によって単語の接続の可否を判定する
接続判定処理とを備え、入力された日本語文字列を先頭
から順次辞書検索して接続可能な単語列候補を抽出して
該候補が複数の場合には、少なくとも、候補単語列の累
積接続度あるいは候補単語列の一単語当りの平均接続度
を用いた判定によって単語列を選択すること、更に詳細
には、入力された日本語文字列を先頭から順次辞書検索
して接続可能な単語列候補を抽出して該候補が複数の場
合に二文節最長によって単語列を決定する単語列候補決
定処理とを備え、単語列候補決定処理によって単語列候
補が一意に定まらない場合には、二文節の候補単語列の
累積接続度あるいは二文節の候補単語列の一単語当りの
平均接続度を用いた判定によって単語列を選択すること
を特徴としたものである。以下、本発明の実施例に基づ
いて説明する。A morpho-conjunctive grammatical feature correspondence table search process that searches for conjunctive grammatical information using the morphological information extracted from a dictionary search, and a connection process that searches for the degree of connection between the consequential grammatical properties of the preceding word and the conative grammatical properties of the following word. It includes a matrix search process and a connection determination process that determines whether or not words can be connected based on the degree of connectivity obtained from the connection matrix search, and sequentially searches the input Japanese character string in a dictionary from the beginning to find connectable words. If the string candidates are extracted and there are a plurality of candidates, the word string is selected by at least a determination using the cumulative degree of connectivity of the candidate word strings or the average degree of connectivity per word of the candidate word strings, and further details. This includes a word string candidate determination process that sequentially searches the input Japanese character string in a dictionary from the beginning to extract connectable word string candidates, and if there are multiple candidates, determines the word string based on the longest two clauses. If a word string candidate cannot be uniquely determined by the word string candidate determination process, determination is made using the cumulative degree of connectivity of the candidate word strings of two clauses or the average degree of connectivity per word of the candidate word strings of two clauses. The feature is that a word string is selected by. Hereinafter, the present invention will be explained based on examples.
第1図は、本発明の一実施例を説明するための構成図で
1図中、1は日本語文字列入力部、2は検索文字列設定
部、3は単語列候補設定部、4は単語列候補選択部、5
は単語辞書、6は単語辞書検索部、7は接続判定部、8
は結果出力部、9は形態接続文法特性対応表、10は形
態接続文法特性対応表検索部、11は接続行列表検索部
、12は接続行列表で1日本語文字列入力部1より入力
された日本語文字列から検索文字列設定部2で辞書検索
に必要な文字列を辞書検索文字列として設定して単語辞
書検索部6に出力する。辞書検索部6では、該検索文字
列の先頭からマツチングする単語を単語辞書5より検索
する。得られた単語に対して、形態接続文法特性対応表
検索部10でその形態情報から形態接続文法特性対応表
の接続文法情報を検索して付加する。接続行列表検索部
11では、これらの単語の各々について、その前接文法
特性とその単語に先行する単語の後接文法特性によって
接続行列表12を検索し、それらの単語間の接続度を求
める。接続判定部7では、接続度=0の場合を接続不可
と判定して接続度=0の単語を却下し、残りを候補とし
て接続度を付加して単語列候補設定部3に出力する。単
語列候補設定部3では、単語列が定める選択基準を越え
。FIG. 1 is a block diagram for explaining one embodiment of the present invention. In the figure, 1 is a Japanese character string input section, 2 is a search string setting section, 3 is a word string candidate setting section, and 4 is a block diagram for explaining an embodiment of the present invention. Word string candidate selection section, 5
is a word dictionary, 6 is a word dictionary search unit, 7 is a connection determination unit, 8
is a result output section, 9 is a morphological connection grammar characteristic correspondence table, 10 is a morphological connection grammar characteristic correspondence table search section, 11 is a connection matrix table search section, and 12 is a connection matrix table inputted from the Japanese character string input section 1. A search character string setting unit 2 sets a character string necessary for dictionary search as a dictionary search character string from the Japanese character strings obtained, and outputs it to a word dictionary search unit 6. The dictionary search unit 6 searches the word dictionary 5 for matching words starting from the beginning of the search string. For the obtained word, the morphological conjunctive grammar characteristic correspondence table search unit 10 searches for conjunctive grammar information of the morphological conjunctive grammatical characteristic correspondence table based on the morphological information and adds it. The connection matrix table search unit 11 searches the connection matrix table 12 for each of these words using its prefix grammatical characteristics and the postfix grammatical characteristics of the word preceding the word, and determines the degree of connectivity between these words. . The connection determination unit 7 determines that the connection is not possible when the degree of connection is 0, rejects the words with the degree of connection = 0, and outputs the remaining words as candidates with the degree of connection added to the word string candidate setting unit 3. In the word string candidate setting section 3, if the word string exceeds the selection criteria defined.
候補が複数の場合に、単語列候補選択部4で適切な候補
を選択して、単語列候補とする。ここで単語列の選択基
準は1例えば、二文節最長法では単語列が二叉節分を構
成する長さとなることである。If there are a plurality of candidates, the word string candidate selection unit 4 selects an appropriate candidate as a word string candidate. Here, the criterion for selecting a word string is 1. For example, in the two-segment longest method, the word string must have a length that constitutes a bifurcated segment.
また、二文節最長法では、候補選択部で選ばれた第一候
補単語列の第一文節が解析結果として確定され、該文節
と同じ文節をもつ単語をもつ単語列が単語列候補として
設定されることになる。解析が進み、入力された文字列
の解析がすべて終了すると、解析結果を結果出力部8に
出力する。In addition, in the two-clause longest method, the first clause of the first candidate word string selected by the candidate selection section is determined as the analysis result, and a word string that has a word with the same clause as the first clause is set as a word string candidate. That will happen. When the analysis progresses and all input character strings have been analyzed, the analysis results are output to the result output section 8.
構成は候補選択部を除き、二文節最長法等の公知の構成
を用いることができ、接続行列検索の結果を各単語ごと
に保持できる必要がある。For the configuration, except for the candidate selection section, a known configuration such as the two-clause longest method can be used, and it is necessary to be able to hold the results of the connection matrix search for each word.
次に、候補選択部の処理の7例を第2図のフローチャー
トを参照して説明する。この処理には複数の単語列候補
が入力され、図に示すように選択条件によって候補を一
つに絞る処理を施す、先ず。Next, seven examples of processing by the candidate selection section will be explained with reference to the flowchart of FIG. In this process, multiple word string candidates are input, and as shown in the figure, the candidates are first narrowed down to one based on selection conditions.
単語列候補の中からその単語列の文字数が最大のものを
選ぶ。候補が複数残った場合は、次に単語列の単語数を
比較して最小の単語列を選ぶ。まだ候補が複数の場合に
は、単語列中の単語間の接続度を全て加算した値を比較
して最大の単語列を選択する。ここで、単語間の接続度
は前記の接続行列検索によって得られた値で、0の時は
接続不可を示し、値が大きいほど接続・結合が容易であ
り、接続の可能性が高いことを表すものとする。さらに
候補が複数であれば、・単語列中の単語の頻度を加算し
た値を比較して最大の単語列を選択する。The word string with the largest number of characters is selected from among the word string candidates. If multiple candidates remain, then compare the number of words in the word strings and select the smallest word string. If there are still multiple candidates, the largest word string is selected by comparing the summed values of all the degrees of connectivity between words in the word strings. Here, the degree of connectivity between words is the value obtained by the above-mentioned connection matrix search, and when it is 0, it indicates that the connection is not possible, and the larger the value, the easier it is to connect/combine, and the higher the possibility of connection. shall be expressed. Furthermore, if there are multiple candidates, select the largest word string by comparing the added frequencies of the words in the word strings.
また、候補選択部の選択処理は判定式を定めて選択する
ことも考えられる。例えば、i単には次式の値Yが最大
の単語列を選択する。Furthermore, it is also conceivable that the selection process of the candidate selection unit may be performed by determining a determination formula. For example, i simply selects the word string with the largest value Y in the following equation.
Y=w1X1+w2X2+w3X3+w4X4ここで、
Xl:単語列の文字数
X2:単語列の単語数
X、:単語列の累積接続度
X4:単語列の累積頻度
W工〜w4=重み係数
効−−二呆
以上の説明から明らかなように、本発明によると、入力
された日本語文章をあらかじめ作成された単語辞書中の
単語とのマツチングにより単語単位あるいは形態素に分
割する処理において、複数の単語列候補があるとき、候
補単語列の累積接続度あるいは平均接続度を用いた判定
によって単語列を選択するようにしたもので、漢字が連
続する複合語の単語分割に関しても−様な処理によって
精度良く解析することができ、処理全体のコストも低く
実現することができる。Y=w1X1+w2X2+w3X3+w4X4 where,
Xl: Number of characters in a word string X2: Number of words in a word string According to the present invention, in the process of dividing an input Japanese sentence into word units or morphemes by matching words in a word dictionary created in advance, when there are multiple word string candidates, the cumulative connection of the candidate word strings This method selects word strings by determining the degree of connection or the average degree of connectivity, and even when it comes to word segmentation of compound words consisting of consecutive kanji, it can be analyzed with high accuracy through similar processing, and the overall processing cost can be reduced. can be achieved at a low level.
第1図は、本発明の一実施例を説明するための構成図、
第2図は、候補選択部の処理の一例を説明するためのフ
ローチャートである。
1・・・日本語文字列入力部、2・・・検索文字列設定
部。
3・・・単語列候補設定部、4・・・単語列候補選択部
。
5・・・単語辞書、6・・・単語辞書検索部、7・・・
接続判定部、8・・・結果出力部、9・・・形態接続文
法特性対応表、10・・・形態接続文法特性対応表検索
部。
11・・・接続行列表検索部、12・・・接続行列表。FIG. 1 is a configuration diagram for explaining one embodiment of the present invention,
FIG. 2 is a flowchart for explaining an example of processing by the candidate selection section. 1...Japanese character string input section, 2...Search string setting section. 3... Word string candidate setting section, 4... Word string candidate selection section. 5... Word dictionary, 6... Word dictionary search section, 7...
Connection determination unit, 8... Result output unit, 9... Morphological connection grammar characteristic correspondence table, 10... Morphological connection grammar characteristic correspondence table search unit. 11... Connection matrix table search unit, 12... Connection matrix table.
Claims (1)
辞書中の単語とのマッチングにより単語単位あるいは形
態素に分割する処理において、単語の形態情報と接続文
法情報を対応づける形態接続文法特性対応表と、前接文
法特性と後接文法特性との間の接続度を示す接続行列と
、入力された文字列と前記単語辞書中の単語文字列のマ
ッチングによって単語の形態情報を得る辞書検索処理と
、辞書検索で抽出された形態情報によって接続文法情報
を検索する形態接続文法特性対応表検索処理と、先行す
る単語の後接文法特性と続く単語の前接文法特性の間の
接続度を検索する接続行列検索処理と、該接続行列検索
より得られた接続度によって単語の接続の可否を判定す
る接続判定処理とを備え、入力された日本語文字列を先
頭から順次辞書検索して接続可能な単語列候補を抽出し
て該候補が複数の場合には、少なくとも、候補単語列の
累積接続度あるいは候補単語列の一単語当りの平均接続
度を用いた判定によって単語列を選択することを特徴と
する日本語形態素解析方式。1. In the process of dividing an input Japanese sentence into word units or morphemes by matching words in a word dictionary created in advance, a morpho-conjunctive grammatical characteristic correspondence table that matches word morphological information and conjunctive grammatical information is used. , a connection matrix indicating the degree of connectivity between a prefix grammatical property and a postfix grammatical property, and a dictionary search process for obtaining word morphology information by matching an input string with a word string in the word dictionary; A morpho-conjunctive grammatical feature correspondence table search process that searches for conjunctive grammatical information using the morphological information extracted from a dictionary search, and a connection process that searches for the degree of connection between the consequential grammatical properties of the preceding word and the conative grammatical properties of the following word. It includes a matrix search process and a connection determination process that determines whether or not words can be connected based on the degree of connectivity obtained from the connection matrix search, and sequentially searches the input Japanese character string in a dictionary from the beginning to find connectable words. If the string candidates are extracted and there are a plurality of candidates, the word string is selected by at least a determination using the cumulative degree of connectivity of the candidate word strings or the average degree of connectivity per word of the candidate word strings. Japanese morphological analysis method.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP63239956A JPH0289157A (en) | 1988-09-26 | 1988-09-26 | Japanese language morpheme analyzing system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP63239956A JPH0289157A (en) | 1988-09-26 | 1988-09-26 | Japanese language morpheme analyzing system |
Publications (1)
Publication Number | Publication Date |
---|---|
JPH0289157A true JPH0289157A (en) | 1990-03-29 |
Family
ID=17052331
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
JP63239956A Pending JPH0289157A (en) | 1988-09-26 | 1988-09-26 | Japanese language morpheme analyzing system |
Country Status (1)
Country | Link |
---|---|
JP (1) | JPH0289157A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5087956A (en) * | 1985-10-25 | 1992-02-11 | Hitachi, Ltd. | Semiconductor memory device |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPS5617467A (en) * | 1979-07-20 | 1981-02-19 | Fujitsu Ltd | Word-to-word connection approval unit |
JPS5727368A (en) * | 1980-07-28 | 1982-02-13 | Fujitsu Ltd | "kana" (japanese syllabary) to "kanji" (chinese character converter) |
JPS6020234A (en) * | 1983-07-15 | 1985-02-01 | Fujitsu Ltd | Japanese language morpheme analysis system |
JPS61184682A (en) * | 1985-02-12 | 1986-08-18 | Ricoh Co Ltd | Kana/kanji conversion processor |
-
1988
- 1988-09-26 JP JP63239956A patent/JPH0289157A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPS5617467A (en) * | 1979-07-20 | 1981-02-19 | Fujitsu Ltd | Word-to-word connection approval unit |
JPS5727368A (en) * | 1980-07-28 | 1982-02-13 | Fujitsu Ltd | "kana" (japanese syllabary) to "kanji" (chinese character converter) |
JPS6020234A (en) * | 1983-07-15 | 1985-02-01 | Fujitsu Ltd | Japanese language morpheme analysis system |
JPS61184682A (en) * | 1985-02-12 | 1986-08-18 | Ricoh Co Ltd | Kana/kanji conversion processor |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5087956A (en) * | 1985-10-25 | 1992-02-11 | Hitachi, Ltd. | Semiconductor memory device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US5752051A (en) | Language-independent method of generating index terms | |
JPH1049549A (en) | Document retrieving device | |
KR19980079586A (en) | Chinese Character Segmentation Method and Application to Chinese Character Error Checking (CEC) System | |
CN114818663B (en) | Hierarchical intelligent pinyin and character matching method | |
JPH0289157A (en) | Japanese language morpheme analyzing system | |
JP3975825B2 (en) | Character recognition error correction method, apparatus and program | |
CN111090338B (en) | Training method of HMM (hidden Markov model) input method model of medical document, input method model and input method | |
JP2659700B2 (en) | Kana-Kanji conversion method | |
JP2002189734A (en) | Device and method for extracting retrieval word | |
KR20020054254A (en) | Analysis Method for Korean Morphology using AVL+Trie Structure | |
KR100818628B1 (en) | Apparatus and method for building patent translation dictionary | |
JPH11338863A (en) | Automatic collection and qualification device for unknown noun and flickering katakana word and storage medium recording processing procedure of the device | |
CN111144096A (en) | HMM-based pinyin completion training method, completion model, completion method and completion input method | |
JP4047895B2 (en) | Document proofing apparatus and program storage medium | |
JPH09146952A (en) | Morpheme analyzing device | |
JP3115459B2 (en) | Method of constructing and retrieving character recognition dictionary | |
JP3139624B2 (en) | Morphological analyzer | |
JP2897191B2 (en) | Japanese morphological analysis system and morphological analysis method | |
JPH08212225A (en) | Language judgement device | |
KR860000681B1 (en) | Hangul/hanja(korean character/chinese character)word processor | |
JP3241854B2 (en) | Automatic word spelling correction device | |
KR0132999B1 (en) | Phonetically optimized word set extracting method | |
JPH0757059A (en) | Character recognition device | |
JP3339879B2 (en) | Character recognition device | |
JPH05108890A (en) | Method and device for character recognition |