JPH05135096A - Morpheme analyzing system - Google Patents

Morpheme analyzing system

Info

Publication number
JPH05135096A
JPH05135096A JP3297517A JP29751791A JPH05135096A JP H05135096 A JPH05135096 A JP H05135096A JP 3297517 A JP3297517 A JP 3297517A JP 29751791 A JP29751791 A JP 29751791A JP H05135096 A JPH05135096 A JP H05135096A
Authority
JP
Japan
Prior art keywords
character
characters
index
word
character string
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
JP3297517A
Other languages
Japanese (ja)
Other versions
JP3109187B2 (en
Inventor
Yoshimichi Okuno
義道 奥野
Jiyousuke Hiraoka
丈介 平岡
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Meidensha Corp
Meidensha Electric Manufacturing Co Ltd
Original Assignee
Meidensha Corp
Meidensha Electric Manufacturing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Meidensha Corp, Meidensha Electric Manufacturing Co Ltd filed Critical Meidensha Corp
Priority to JP03297517A priority Critical patent/JP3109187B2/en
Publication of JPH05135096A publication Critical patent/JPH05135096A/en
Application granted granted Critical
Publication of JP3109187B2 publication Critical patent/JP3109187B2/en
Anticipated expiration legal-status Critical
Expired - Fee Related legal-status Critical Current

Links

Abstract

PURPOSE:To also extract the wrong characters or missed characters of a sen tence while enabling high-speed retrieval and reducing memory capacity. CONSTITUTION:A character index is generated to sort all the characters contained in the respective words of a word dictionary one by one and to define preserving positions in the respective words of the word dictionary as an index list and for the morpheme analysis of an input text, the index list of each character is extracted from the character index concerning the character string of the input text. When the index list is continued, the character string is judged as one word and concerning the character string for which the word is not judged, it is judged that wrong characters, missed characters or any frequent character exists at the discontinuous part.

Description

【発明の詳細な説明】Detailed Description of the Invention

【0001】[0001]

【産業上の利用分野】本発明は、自然言語処理のための
形態素解析方式に関する。
BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a morphological analysis method for natural language processing.

【0002】[0002]

【従来の技術】自然言語処理は、ワードプロセッサや機
械翻訳などに応用されてきており、処理過程としてはま
ず入力テキストを単語毎に区切って品詞情報や意味情報
を与える形態素解析、つづいて統語処理(構文解析)が
行われ、これら処理で残る曖昧性や漠然性を取除くため
の意味処理や文脈処理などが行われる。
2. Description of the Related Art Natural language processing has been applied to word processors and machine translations. As a processing process, first, morphological analysis is performed to divide input text into words to give part-of-speech information and semantic information, followed by syntactic processing ( Parsing) is performed, and semantic processing and context processing are performed to remove the ambiguity and vagueness that remain in these processes.

【0003】従来の形態素解析方式には有限オートマト
ンによる方式、連想記憶による方式さらにはニューロ技
術を利用した方式がある。
Conventional morphological analysis methods include a method using a finite automaton, a method using associative memory, and a method using a neuro technique.

【0004】有限オートマトンによる方式は、文字マッ
チングの状態遷移図よりオートマトンを生成し、ソフト
ウエア又はワイヤードロジックによって文字列の単語理
解を行う。
In the finite automaton method, an automaton is generated from a state transition diagram of character matching, and a word of a character string is understood by software or wired logic.

【0005】連想記憶による方式は、メモリ本体に検索
機能を与えて文字列のセル毎の比較によって全文検索を
行う。
In the associative memory system, a full text search is performed by providing a search function to the memory body and comparing character strings cell by cell.

【0006】ニューロ技術を利用した方式は、ニューロ
コンピュータにより学習を行いながら全文検索を行う。
In the method using the neuro technology, full text search is performed while learning is performed by a neuro computer.

【0007】[0007]

【発明が解決しようとする課題】従来の形態素解析方式
において、有限オートマトン方式は、文字列の字抜けや
ワールドカードの利用による曖昧検索も可能であるが、
文字列の一文字目が曖昧になったときの対応がとりにく
い問題があった。また、ハードウエア構成の場合にはア
プリケーションソフトの様々な検索要求に対応しきれな
いし、ソフトウエア構成の場合には検索スピードの点で
他の方式に劣ることがある。
In the conventional morphological analysis method, the finite automaton method is capable of missing characters in a character string and fuzzy search by using a world card.
There was a problem that it was difficult to handle when the first character of the character string became ambiguous. Further, in the case of a hardware configuration, various search requests of application software cannot be handled, and in the case of a software configuration, the search speed may be inferior to other methods.

【0008】次に、連想記憶方式は、メモリ本体に大容
量のものを必要として高いコストになる。なお、メモリ
本体をディスクのような大容量記憶装置を利用すること
は技術的に非常に難しく、現在ではメモリ本体が1MB
にも満たない。また、ソフトウエア、特にアプリケーシ
ョンとの結合が難しく、複雑な文章の検索にはアプリケ
ーション側のプログラムも複雑になってしまう。
Next, the associative memory system requires a large-capacity memory body, resulting in high cost. It is technically very difficult to use a mass storage device such as a disk for the memory body, and the memory body is currently 1 MB.
Less than Further, it is difficult to combine with software, especially with an application, so that a program on the application side also becomes complicated for searching a complicated sentence.

【0009】次に、ニューロコンピュータ方式は、学習
までは検索に時間がかかるが、学習後には文書の量にあ
まり左右されずに高速検索ができ、また曖昧検索も可能
である。しかし、問題はベタのテキストイメージに対す
る検索であれば高速検索になるが、構造を持った電子化
辞書のようなデータベースに対して対応が難しくなる。
Next, in the neurocomputer method, although it takes time to search until learning, after learning, high-speed searching can be performed without being much influenced by the amount of documents, and fuzzy searching is also possible. However, the problem is that if the search is for a solid text image, it will be a fast search, but it will be difficult to deal with a database such as a structured electronic dictionary.

【0010】本発明の目的は、高速検索及び小容量メモ
リ構成にしながら文章の誤字や脱字の抽出も可能にした
形態素解析方式を提供することにある。
An object of the present invention is to provide a morphological analysis method which enables extraction of typographical errors or omissions of sentences while achieving high-speed retrieval and a small-capacity memory configuration.

【0011】[0011]

【課題を解決するための手段】本発明は、前記課題の解
決を図るため、日本語文字の単語を表記文字と属性リス
トの構造体で保存する単語辞書と、前記単語辞書中の表
記文字に含まれる文字を1文字毎に全てソートした文字
及び該表記文字中の保存位置になるインデックスリスト
とを有する構造体の文字インデックスと、形態素解析対
象となる入力テキストの形態素を解析する処理装置とを
備え、前記処理装置は入力テキストの各文字に一致する
前記文字インデックスの文字とインデックスリストを抽
出し、前記保存位置が連続する文字列は1つの単語とし
て抽出し、この抽出に失敗した文字列中に1文字又は所
定数文字分だけ保存位置が連続しないときに該1文字又
は所定数文字分に誤字,脱字,多字の何れかが存在する
ことを判定する。
In order to solve the above-mentioned problems, the present invention provides a word dictionary for storing words of Japanese characters in a notation character and a structure of an attribute list, and a notation character in the word dictionary. A character index of a structure having a character in which all the included characters are sorted for each character and an index list which is a storage position in the notation character, and a processing device which analyzes a morpheme of an input text to be a morpheme analysis target. The processing device extracts the character of the character index that matches each character of the input text and the index list, extracts the character string having the consecutive storage positions as one word, and extracts the character string that fails to be extracted. When the storage positions are not consecutive by one character or a predetermined number of characters, it is determined that any one of the one character or the predetermined number of characters includes an erroneous character, a missing character, or a multi-character.

【0012】[0012]

【作用】本発明によれば、単語辞書に保存される1文字
毎に文字インデックスを生成しておき、入力テキストの
文字列を構成する各文字について文字インデックスのイ
ンデックスリストから保存位置の連続性の有無を判定す
ることで当該文字列が1つの単語か否かを解析する。ま
た、連続しない1文字又は所定数文字を含むときに当該
文字又は文字列が誤字等になることの判定を得る。
According to the present invention, a character index is generated for each character stored in the word dictionary, and the continuity of the storage position is determined from the index list of the character index for each character forming the character string of the input text. By determining the presence or absence, it is analyzed whether or not the character string is one word. Also, when one character or a predetermined number of characters that are not continuous are included, it is determined that the character or the character string becomes an erroneous character.

【0013】[0013]

【実施例】図1は本発明の一実施例を示す構成図であ
る。ファイル構成の単語辞書1は、平仮名,片仮名,英
数字,単漢字,熟語を含む単語の表記とその属性リスト
の構造体を有して格納されている。例えば、構造体ポイ
ンタKには構造体1kで示すように表記に「秋雨」とい
う熟語が属性リストの品詞に「名詞」が、読みに「アキ
サメ」等が格納されている。
1 is a block diagram showing an embodiment of the present invention. The file-structured word dictionary 1 is stored with a notation of words including hiragana, katakana, alphanumeric characters, single kanji, and idioms and a structure of its attribute list. For example, in the structure pointer K, as shown by the structure 1k, the idiom "Akiyu" is stored in the notation, "noun" is stored in the part of speech of the attribute list, and "akime" is stored in the reading.

【0014】ファイル構成の文字インデックス2は、単
語辞書1をデータベースとして処理装置3によって生成
され、単語辞書1の全ての表記文字について1文字毎の
インデックスリストが格納されている。例えば、ポイン
タHには構成2nで示すようにメンバー文字に「秋」
が、そのインデックスリストには文字「秋」を持つ単語
中の当該文字「秋」のアドレスが格納されている。この
インデックスリストは、文字「秋」を表記に含む単語辞
書1の単語を構成する文字「秋」の保存位置になり、例
えば単語辞書1に「秋雨」、「中秋」、「秋月」という
単語が保存されていると、夫々が文字「秋」を含むこと
から夫々の単語中の文字「秋」のアドレス「A1」、
「A8」、「A21」が保存位置データとして生成,保存
される。
The character index 2 of the file structure is generated by the processing device 3 using the word dictionary 1 as a database, and an index list for each character of all the written characters of the word dictionary 1 is stored. For example, the pointer H has a member character of "autumn" as shown in the configuration 2n.
However, the index list stores the address of the character "autumn" in a word having the character "autumn". This index list is the storage position of the letters "autumn" that compose the words of the word dictionary 1 that include the letters "autumn". For example, the words "autumn rain,""midautumn," and "autumn moon" are stored in the word dictionary 1. When stored, the address "A 1 " of the character "autumn" in each word, because each contains the character "autumn",
“A 8 ” and “A 21 ” are generated and saved as save position data.

【0015】この文字インデックス2の生成手順は、単
語辞書1中の全ての表記文字について辞書ファイルの各
単語の先頭からの位置(アドレス)を抽出し、その情報
を文字毎に集めてソートし、次いで文字とアドレスを示
したインデックスの対を集めて文字インデックスファイ
ルに保存する。この保存振分けは、例えば保存位置の衝
突を避けるハッシュ関数が使用され、高速の検索環境に
も構築される。
The procedure for generating the character index 2 is to extract the position (address) from the beginning of each word in the dictionary file for all the notation characters in the word dictionary 1, collect the information for each character, and sort it. Then, the index pair indicating the character and the address is collected and stored in the character index file. For this storage allocation, for example, a hash function that avoids collision of storage positions is used, and it is also constructed in a high-speed search environment.

【0016】処理装置3は単語辞書1からの文字インデ
ックス2の生成処理を行った後は、インターフェース4
を通して与えられる入力テキスト(仮名,漢字混じりの
文章)について形態素解析処理を行う。
After the processing device 3 has generated the character index 2 from the word dictionary 1, the interface 4
Morphological analysis processing is performed on the input text (texts mixed with kana and kanji) given through.

【0017】この処理は図2に示す手順で実行される。
まず、入力テキストに対し最長一致法などによる形態素
解析がなされる(ステップS1)。この処理には辞書と
のマッチングに文字インデックス2を使用し、入力テキ
ストの第1番目の文字から連続した文字列を1文字づつ
文字インデックス2の文字照合からそのインデックスリ
ストを順番に取出し、該インデックスリスト列の距離が
全て1になっているものかつ最長のものがあれば単語辞
書1中に当該文字列が存在すると判定し、当該文字列を
形態素解析リストとして決定する。
This process is executed according to the procedure shown in FIG.
First, the input text is subjected to morphological analysis by the longest match method or the like (step S1). For this processing, character index 2 is used for matching with the dictionary, and a character string consecutive from the first character of the input text is extracted character by character from character index 2 and the index list is taken out in order. If the distances of the list strings are all 1 and there is the longest one, it is determined that the character string exists in the word dictionary 1, and the character string is determined as the morphological analysis list.

【0018】この処理を図3に示す例で説明する。同図
中、(a)には単語辞書1に保存される単語「東南アジ
ア」の表記部分がアドレス「a」から「a+4」までに
保存される場合を示す。この単語に対し、文字インデッ
クス2には同図(b)に示すように文字「ア」について
はインデックスリストにアドレス「a+2」と「a+
4」が他のアドレスと共に書込まれており、文字「ジ」
についてはインデックスリストにアドレス「a+3」が
他のアドレスと共に書込まれ、文字「東」にはアドレス
「a」が、文字「南」にはアドレス「a+1」が書込ま
れている。
This process will be described with reference to the example shown in FIG. In the figure, (a) shows a case where the written portion of the word “Southeast Asia” stored in the word dictionary 1 is stored at addresses “a” to “a + 4”. For this word, in the character index 2, as shown in FIG. 2B, the address "a + 2" and "a +" are added to the index list for the character "a".
"4" is written with other addresses, and the character "Ji"
For, the address “a + 3” is written in the index list together with other addresses, the address “a” is written in the character “east”, and the address “a + 1” is written in the character “south”.

【0019】ここで、形態素解析に際しては、入力テキ
スト中に文字列「東南アジア」が含まれていると、文字
インデックス2から文字列「東南アジア」のインデック
スリストを読出し、その中に含まれるアドレス「a+
2」、「a+4」、「a+3」、「a」、「a+1」か
ら隣接文字間の距離が全て1になることが認識される。
例えば文字「東」と「南」の距離は「a+1」−「a」
=1になる。従って、文字列「東南アジア」は単語辞書
1中に存在すると判定でき、1つの単語として形態素リ
ストに上げられる。
Here, in the morphological analysis, when the input text includes the character string "Southeast Asia", the index list of the character string "Southeast Asia" is read from the character index 2 and the address "a +" included therein is read.
It is recognized that the distances between adjacent characters are all 1 from "2", "a + 4", "a + 3", "a", "a + 1".
For example, the distance between the letters "East" and "South" is "a + 1"-"a".
= 1. Therefore, it can be determined that the character string "Southeast Asia" exists in the word dictionary 1, and it is listed in the morpheme list as one word.

【0020】図2に戻って、ステップS1の処理によっ
て入力テキストは単語毎の形態素リストとして抽出され
るが、この形態素解析に失敗する文字が残ることがあ
る。この解析に失敗した文字列,文字は単漢字文字列リ
ストとして抽出される(ステップS2)。
Returning to FIG. 2, the input text is extracted as a morpheme list for each word by the process of step S1. However, there are some characters that fail in this morpheme analysis. Character strings and characters that have failed in this analysis are extracted as a single Kanji character string list (step S2).

【0021】抽出された文字列,文字について誤字,脱
字及び多字があるか否かを検出・修正する(ステップS
3)。このうち、誤字の検出は、検出対象文字列の1文
字づつに文字インデックス2を参照してそのインデック
スリストを読出し、前の文字と1つ後の文字について夫
々のアドレス間距離を求め、この距離が1でないものが
あったときには当該文字をとばして次の文字に対するア
ドレス間距離を求め、前の文字との距離が2になるとき
にとばした文字を誤字と判定する。
It is detected / corrected whether or not there are erroneous characters, omissions, and multiple characters in the extracted character strings and characters (step S).
3). Among them, the typographical error is detected by referring to the character index 2 for each character of the detection target character string, reading the index list, and obtaining the distance between the addresses of the preceding character and the succeeding character. When there is a character other than 1, the character is skipped to obtain the address distance to the next character, and when the distance from the previous character becomes 2, the skipped character is determined to be a typographical error.

【0022】例えば、形態素解析に失敗した文字列が
「東軟アジア」であった場合、文字「東」と「軟」とは
そのインデックスリストにあるアドレス間距離が1にな
らない。このとき、文字「軟」をとばして次の文字
「ア」と前の文字「東」との距離をチェックすると2に
なるため、文字「軟」を誤字と判定する。
For example, if the character string for which the morphological analysis has failed is "Tohkoh Asia", the distance between addresses in the index list of the characters "Higashi" and "Soft" is not 1. At this time, when the character "soft" is skipped and the distance between the next character "a" and the previous character "east" is checked, it becomes 2. Therefore, the character "soft" is determined to be a typographical error.

【0023】次に、脱字の検出は、検出対象文字列の1
文字づつに文字インデックス2を参照してそのインデッ
クスリストを読出し、前の文字との間のアドレス間距離
を求め、この距離が2になるものがあったとき両文字間
に脱字があったと判定する。
Next, the detection of missing characters is performed by using 1 in the character string to be detected.
The character index 2 is referred to for each character, the index list is read, the distance between addresses with the previous character is calculated, and when there is a distance that becomes 2, it is determined that there is a caret between both characters. ..

【0024】例えば、形態素解析に失敗した文字列が
「東南アア」であった場合、第3番目の文字「ア」と第
4番目の文字「ア」との間のアドレス間距離が2にな
り、両文字「ア」と「ア」間に脱字があったと判定す
る。
For example, if the character string for which the morpheme analysis has failed is "southeast aa", the inter-address distance between the third character "a" and the fourth character "a" is 2. , It is determined that there is a missing character between the two characters "A" and "A".

【0025】次に、多字の検出は、検出対象文字列の1
文字づつのインデックスリストを読出し、前の文字との
間のアドレス間距離が1でないものがあったとき、当該
文字をとばして次の文字との間の距離を求め、この距離
が1になるときはとばした文字を多字と判定する。
Next, the multi-character detection is performed by using 1 of the character string to be detected.
When the index list for each character is read and the distance between the previous character and the address is not 1, skip that character and find the distance to the next character, and if this distance becomes 1. The skipped characters are judged to be multi-characters.

【0026】例えば、形態素解析に失敗した文字列が
「東南軟アジア」であった場合、文字「南」と文字
「軟」との距離が1でないため、文字「軟」をとばして
文字「ア」と文字「南」との距離を求め、この距離が1
になるため文字「軟」を多字と判定する。
For example, if the character string for which the morphological analysis has failed is "Southeast Soft Asia", the distance between the characters "South" and "Soft" is not 1, so the character "Soft" is skipped and the character "A" is skipped. ", And the character" south "is calculated, and this distance is 1
Therefore, the character "soft" is determined to be a multi-character.

【0027】再び図2に戻って、ステップS3による誤
字、脱字、多字の検出・修正が施された文字列はステッ
プS1での解析で求められた形態素リストに戻され、正
しく形態素解析された文字列の候補リストとして取出さ
れる。この候補リストは他の文字列との接続チェック処
理がなされて形態素解析を終了する(ステップS4)。
この接続チェック処理は、例えば前の単語に対する品詞
からチェックする。
Returning to FIG. 2 again, the character string in which erroneous characters, omissions and multi-characters have been detected / corrected in step S3 is returned to the morpheme list obtained by the analysis in step S1 and correctly morpheme analyzed. It is taken out as a candidate list of character strings. This candidate list is subjected to a connection check process with another character string, and the morphological analysis ends (step S4).
In this connection check process, for example, the part of speech for the previous word is checked.

【0028】以上のとおり、本実施例は単語辞書1から
文字インデックスを生成しておき、解析対象文字列のア
ドレス間距離の連続性から形態素解析を行うと共に誤
字、脱字、多字の検出を行う。
As described above, in this embodiment, the character index is generated from the word dictionary 1, and the morpheme analysis is performed from the continuity of the distance between the addresses of the analysis target character string, and the erroneous character, the missing character, and the multiple character are detected. ..

【0029】このため、単語辞書との文字列照合に較べ
て当該文字を含む単語を文字インデックスから直接に検
索し得て高速解析を得ることができ、さらにテキストの
1文字目が曖昧になるときの解析も含めて誤字、脱字、
多字のチェックを容易にする。
Therefore, as compared with the character string collation with the word dictionary, the word containing the character can be directly searched from the character index for high-speed analysis, and when the first character of the text becomes ambiguous. Typographical errors, omissions, including analysis of
Make it easy to check multi-characters.

【0030】また、メモリ容量としては文字インデック
スを確保できるものであれば良く、コンピュータの内部
メモリ等の比較的小容量のもので済むし、アプリケーシ
ョン側のプログラムを複雑にすることは無い。
The memory capacity may be any as long as the character index can be secured, and a relatively small capacity such as the internal memory of the computer is sufficient, and the program on the application side is not complicated.

【0031】さらに、電子化辞書等のデータベースの解
析にも容易に対応できる。
Further, it is possible to easily deal with analysis of a database such as an electronic dictionary.

【0032】なお、実施例では1文字の誤字,脱字,多
字の検出を行う場合を示すが、n文字(2文字や3文
字)の誤字、脱字、多字検出にも応用することができ
る。
Although the embodiment shows the case of detecting one erroneous character, omission, and multi-character, it can be applied to detection of n characters (two or three characters) erroneous character, omission, and multi-character. ..

【0033】例えば、n文字の誤字検出にはアドレス間
距離が1でない文字があったときにn文字とばしてアド
レス間距離を求め、これがn+1の距離になったときに
とばしたn文字を誤字と判定する。
For example, in detecting erroneous characters of n characters, when there is a character whose inter-address distance is not 1, the inter-address distance is obtained by skipping n characters, and when this becomes n + 1, the skipped n characters are erroneous characters. judge.

【0034】同様に、n文字の多字検出には距離が1で
ない文字があったときにn文字とばして距離を求め、こ
れが距離1になったときにその間のn文字を多字と判定
する。また、n文字の多字検出は、距離が1でない文字
がありかつ距離がn+1になっているときに該文字間に
脱字があると判定する。
Similarly, when detecting a multi-character of n characters, when there is a character whose distance is not 1, the n characters are skipped to obtain the distance, and when the distance becomes 1, the n characters between them are determined to be multi-characters. .. In addition, in the multi-character detection of n characters, it is determined that there is a character whose distance is not 1 and there is a missing character between the characters when the distance is n + 1.

【0035】なお、上述のn文字の誤字,脱字の検出に
ついては文字総数のチェックを加えることによって単語
末尾の誤字,脱字の検出ができる。このためには、図4
に示すように、文字インデックスリストにアドレスデー
タのほかに当該文字を含む単語の文字数と当該単語内で
の文字位置をメンバーとして加えておき、n文字目が辞
書の単語文字数と文字位置で一致したか否かを判定に加
える。
Regarding the above-mentioned n-character erroneous characters and omissions, it is possible to detect erroneous characters and omissions at the end of a word by checking the total number of characters. To this end,
As shown in, the number of characters of the word containing the character and the character position in the word are added as members to the character index list in addition to the address data, and the nth character matches the number of word characters in the dictionary and the character position. Whether or not it is added to the judgment.

【0036】[0036]

【発明の効果】以上のとおり、本発明によれば、入力テ
キストの文字列について文字インデックスから抽出した
単語辞書の保存位置の連続性から形態素解析及び誤字,
脱字,多字の検出を行うようにしたため、単語辞書と文
字列の照合になる解析に較べて高速検索になり、また文
字インデックスには小容量のメモリ確保で済み、さらに
文章の誤字,脱字,多字の検証を行うことができる。
As described above, according to the present invention, morphological analysis and typographical errors are detected from the continuity of the storage position of the word dictionary extracted from the character index of the character string of the input text.
Since the detection of punctuation and multi-characters is performed, the search speed is higher than that of the analysis which is a collation of a word dictionary and a character string. Also, a small memory can be secured for the character index. Multi-character verification can be performed.

【図面の簡単な説明】[Brief description of drawings]

【図1】本発明の一実施例を示す構成図。FIG. 1 is a configuration diagram showing an embodiment of the present invention.

【図2】実施例の解析処理手順図。FIG. 2 is an analysis processing procedure diagram of the embodiment.

【図3】実施例における形態素解析の態様図。FIG. 3 is a mode diagram of morphological analysis according to an embodiment.

【図4】他の実施例における態様図。FIG. 4 is a diagram illustrating another embodiment.

【符号の説明】[Explanation of symbols]

1…単語辞書、2…文字インデックス、3…処理装置。 1 ... Word dictionary, 2 ... Character index, 3 ... Processing device.

Claims (1)

【特許請求の範囲】[Claims] 【請求項1】 日本語文字の単語を表記文字と属性リス
トの構造体で保存する単語辞書と、前記単語辞書中の表
記文字に含まれる文字を1文字毎に全てソートした文字
及び該表記文字中の保存位置になるインデックスリスト
とを有する構造体の文字インデックスと、形態素解析対
象となる入力テキストの形態素を解析する処理装置とを
備え、前記処理装置は入力テキストの各文字に一致する
前記文字インデックスの文字とインデックスリストを抽
出し、前記保存位置が連続する文字列は1つの単語とし
て抽出し、この抽出に失敗した文字列中に1文字又は所
定数文字分だけ保存位置が連続しないときに該1文字又
は所定数文字分に誤字,脱字,多字の何れかが存在する
ことを判定する形態素解析方式。
1. A word dictionary that stores words of Japanese characters as a notation character and a structure of an attribute list, a character in which all the characters included in the notation character in the word dictionary are sorted, and the notation character A character index of a structure having an index list that is a storage position in the inside, and a processing device that analyzes a morpheme of an input text that is a morpheme analysis target, the processing device including the character that matches each character of the input text When the character string of the index and the index list are extracted, the character string in which the storage positions are continuous is extracted as one word, and when the storage position is not continuous by one character or a predetermined number of characters in the character string that has failed to be extracted, A morphological analysis method for determining whether any one of the one character or a predetermined number of characters is erroneous, missing, or polymorphic.
JP03297517A 1991-11-14 1991-11-14 Morphological analysis method Expired - Fee Related JP3109187B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP03297517A JP3109187B2 (en) 1991-11-14 1991-11-14 Morphological analysis method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP03297517A JP3109187B2 (en) 1991-11-14 1991-11-14 Morphological analysis method

Publications (2)

Publication Number Publication Date
JPH05135096A true JPH05135096A (en) 1993-06-01
JP3109187B2 JP3109187B2 (en) 2000-11-13

Family

ID=17847547

Family Applications (1)

Application Number Title Priority Date Filing Date
JP03297517A Expired - Fee Related JP3109187B2 (en) 1991-11-14 1991-11-14 Morphological analysis method

Country Status (1)

Country Link
JP (1) JP3109187B2 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2019091174A (en) * 2017-11-13 2019-06-13 富士通株式会社 Information generation program, word extraction program, information processing apparatus, information generation method and word extraction method

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2019091174A (en) * 2017-11-13 2019-06-13 富士通株式会社 Information generation program, word extraction program, information processing apparatus, information generation method and word extraction method

Also Published As

Publication number Publication date
JP3109187B2 (en) 2000-11-13

Similar Documents

Publication Publication Date Title
US6424983B1 (en) Spelling and grammar checking system
US7092871B2 (en) Tokenizer for a natural language processing system
EP0971294A2 (en) Method and apparatus for automated search and retrieval processing
EP0378848A2 (en) Method for use of morphological information to cross reference keywords used for information retrieval
WO1997004405A9 (en) Method and apparatus for automated search and retrieval processing
JP2014238865A (en) Coreference resolution in ambiguity-sensitive natural language processing system
Zhang et al. Automated multiword expression prediction for grammar engineering
US6470334B1 (en) Document retrieval apparatus
US20050004790A1 (en) Processing noisy data and determining word similarity
JP4278011B2 (en) Document proofing apparatus and program storage medium
JP3109187B2 (en) Morphological analysis method
Kanada A method of geographical name extraction from Japanese text for thematic geographical search
JP4047895B2 (en) Document proofing apparatus and program storage medium
JP2792147B2 (en) Character processing method and device
JP4318223B2 (en) Document proofing apparatus and program storage medium
JP4047894B2 (en) Document proofing apparatus and program storage medium
KR20000039406A (en) Method for indexing compound noun with complement-predicate relation through part sentence structure analysis
Singh et al. Intelligent Bilingual Data Extraction and Rebuilding Using Data Mining for Big Data
JP2894736B2 (en) Sentence inspection method
JP2595043B2 (en) Automatic Japanese text error verification device
JP2592995B2 (en) Phrase extraction device
JP3884001B2 (en) Language analysis system and method
JPH10240736A (en) Morphemic analyzing device
JPH0248938B2 (en)
JPH0546612A (en) Sentence error detector

Legal Events

Date Code Title Description
LAPS Cancellation because of no payment of annual fees