JPH03263260A - Japanese language analyzer - Google Patents

Japanese language analyzer

Info

Publication number
JPH03263260A
JPH03263260A JP2063234A JP6323490A JPH03263260A JP H03263260 A JPH03263260 A JP H03263260A JP 2063234 A JP2063234 A JP 2063234A JP 6323490 A JP6323490 A JP 6323490A JP H03263260 A JPH03263260 A JP H03263260A
Authority
JP
Japan
Prior art keywords
word
unregistered
character string
proper noun
words
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
JP2063234A
Other languages
Japanese (ja)
Inventor
Yuji Ito
雄二 伊藤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Panasonic Holdings Corp
Original Assignee
Matsushita Electric Industrial Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Matsushita Electric Industrial Co Ltd filed Critical Matsushita Electric Industrial Co Ltd
Priority to JP2063234A priority Critical patent/JPH03263260A/en
Publication of JPH03263260A publication Critical patent/JPH03263260A/en
Pending legal-status Critical Current

Links

Abstract

PURPOSE:To infer the attribute of a proper noun and to decide an unregistered word range by performing unregistered word processing centering the detection of the proper noun, and detecting the attribute by the information of a word cogenerated with the proper noun. CONSTITUTION:Character string data comprising a Japanese language text is inputted to an input means 1, and is stored in an input character string storage part 2. Also, the word and the attribute data of the word are stored in a word dictionary part 3, and the word to be connected to the proper noun and the attribute of the proper noun are stored in a cogenerative word table 4. In such a case, a morphemic analysis control part 14 performs analysis by referring the character string data stored in the storage part 2 to the dictionary part 3. When no word equivalent to a character string targeted to be analyzed exists or no connection by the word in a connection table 6 is performed, the word is judged as an unregistered word, and the unregistered word processing is started from the word in the character string in which the word equivalent to the character string targeted to be analyzed exists. Here, an unregistered word detection control part 11 decides the temporary range of the unregistered word at an unregistered word temporary detecting part 8, and the attribute of a detected proper noun is inferred.

Description

【発明の詳細な説明】 [産業上の利用分野J 本発明は、日本語文を構成する各単語の品詞や構成を解
析する日本語解析装置に関するものである。
DETAILED DESCRIPTION OF THE INVENTION [Industrial Field of Application J] The present invention relates to a Japanese language analysis device that analyzes the part of speech and structure of each word constituting a Japanese sentence.

[従来の技術] 近年、機械翻訳に代表される自然言語処理技術への要請
が高まってきている。その中で、漢字かな混じり文の形
態素解析は、日本語解析を行なう場合最初に行なわれる
処理であり、その精度は全体の解析精度に影響を及ぼす
。ところで、形態素解析の際、未登録語の処理が大きな
問題となる。
[Background Art] In recent years, there has been an increasing demand for natural language processing technology represented by machine translation. Among these, morphological analysis of sentences containing kanji and kana is the first process performed when analyzing Japanese, and its accuracy affects the overall analysis accuracy. By the way, processing of unregistered words poses a major problem during morphological analysis.

未登録語を文中から切り出す方法としては、文の先頭か
ら解析を行なって、辞書引きに失敗したところから同じ
字種が続く範囲を未登録語とする方法、失敗した文字を
スキップして次の文字から辞書引きを行い単語を抽出で
きるまでそれを繰り返し単語が抽出できた時点でその単
語の直前の文字までを未登録語とする、等が一般に行な
われている。
There are two ways to extract unregistered words from a sentence: analyze from the beginning of the sentence and define the range where the same character type continues from the point where dictionary lookup fails as unregistered words, or skip the failed character and move on to the next one. Generally, a dictionary is searched from a character, and this process is repeated until a word can be extracted, and at the point when a word can be extracted, the characters up to the character immediately before that word are treated as unregistered words.

[発明が解消しようとする課題] 上述の中で、同じ字種が続く範囲を未登録語とする方法
では、例えば「中曽根首相」という文字列があった場合
に、「中コのところで辞書引きに失敗すると、「首相」
という登録された語があるにもかかわらず全体が未登録
語となる。また、同じ例で、−文字ずつずらしながら辞
書引きを行なう方法について考えてみると、3回目で「
首相」が引けてその前の「中曽根」が未登録語であると
いう認定がなされることになるが、文字の並び方によっ
ては間違った切り出し方をしてしまう可能性がある。
[Problem to be solved by the invention] In the method described above, in which the range of consecutive characters of the same type is treated as unregistered words, for example, if there is a character string "Prime Minister Nakasone", "Dictionary lookup at Nakako" If you fail, "Prime Minister"
Even though there is a registered word, the entire word is unregistered. Also, in the same example, if you think about how to look up a dictionary by shifting - characters, on the third try,
When the word ``Prime Minister'' is drawn, the word ``Nakasone'' before it will be recognized as an unregistered word, but depending on the way the letters are arranged, there is a possibility that the word may be extracted incorrectly.

1課題を解決するための手段] 未登録語として検出される語を見てみると、そのほとん
どが名詞でありその中でも固有名詞(人名、地名、団体
名〉の占める割合が大きいと思われる。また、これらの
固有名詞は文中に単独で現れることは少なく、共起単語
(例:首相、社長、市、銀行など〉をともなって現れる
場合が多い。
Measures to Solve Problem 1] Looking at the words detected as unregistered words, most of them are nouns, and among these, proper nouns (names of people, places, organizations) seem to account for a large proportion. Furthermore, these proper nouns rarely appear alone in a sentence, but often appear with co-occurring words (eg, prime minister, president, city, bank, etc.).

このことに着目し、この課題を解決するために本発明は
、固有名詞に接続する単語と、該単語が接続する固有名
詞の属性を記憶した共起単語テーブルを備え、記憶手段
に記憶された文字列データを単語辞書を参照して解析す
る解析手段と、解析手段によって解析不可能で共起単語
テーブルに記憶された語に接続した文字列の属性を共起
テーブルを参照して得るよう構成される。
Focusing on this, and in order to solve this problem, the present invention includes a co-occurrence word table that stores words connected to proper nouns and attributes of proper nouns to which these words connect, and the co-occurrence word table is stored in a storage means. An analysis means for analyzing character string data by referring to a word dictionary; and a structure configured to refer to the co-occurrence table to obtain attributes of character strings connected to words that cannot be analyzed by the analysis means and are stored in the co-occurrence word table. be done.

[作用j 上記の手段で固有名詞の検出を中心とした未登録語処理
を行うことにより、文中の未登録語の間違った解析の可
能性を少なくし、固有名詞と共起する単語のもつ情報に
より検出した固有名詞の属性(人名、地名、団体名)を
推定することができる。
[Operation j] By performing unregistered word processing that focuses on detecting proper nouns using the above method, the possibility of incorrect analysis of unregistered words in a sentence is reduced, and the information possessed by words that co-occur with proper nouns is reduced. It is possible to estimate the attributes (person name, place name, organization name) of the proper noun detected by this method.

を実施例] 以下、本発明の一実施例における日本語解析装置につい
て図面を参照しながら説明を行う。
Embodiment] A Japanese language analysis device according to an embodiment of the present invention will be described below with reference to the drawings.

第1図は、本発明の一実施例における日本語解析装置の
機能ブロック図である。同図において、]は日本語文字
列の入力手段である。2は入力手段lから入力される日
本語文字列を記憶する入力文字列記憶部、3は単語の読
み2表記2品詞情報などからなる形態素情報を保持して
いる単語辞書部、4は後述するように、人名や地名、団
体名といった固有名詞と共起して現れる可能性の高い単
語の表記、読みおよび共起する固有名詞の属性(前述の
人名、地名、団体名)を持つ共起単語テーブルで、5は
共起単語の接頭辞として使われることの多い語を登録す
る接頭辞テーブルである。
FIG. 1 is a functional block diagram of a Japanese language analysis device according to an embodiment of the present invention. In the figure, ] is a means for inputting Japanese character strings. 2 is an input character string storage unit that stores Japanese character strings input from the input means 1; 3 is a word dictionary unit that holds morpheme information including word readings, 2 spellings, 2 part-of-speech information; and 4, which will be described later. , the spelling and pronunciation of words that are likely to co-occur with proper nouns such as people's names, place names, and organization names, as well as co-occurring words that have the attributes of co-occurring proper nouns (person names, place names, and organization names mentioned above). In the table, 5 is a prefix table that registers words that are often used as prefixes of co-occurring words.

6は単語間の接続可/不可の情報を持つ接続テーブルで
ある。7は入力文字列中の解析対象文字列と、接頭辞テ
ーブル5、共起単語テーブル4、単語辞書3を頭から順
次比較して、解析対象文字列と最も長く一致する単語を
検索する辞書・テーブル検索部である。8は辞書検索に
失敗した場合に後述する方法により仮の未登録文字列範
囲を検出する未登録語仮範囲検出部、9はその未登録文
字列中から後述の方法により固有名詞を検出する固有名
詞検出部、さらに10は固有名詞検出部9で検出した共
起単語の持つ情報からその固有名詞の属性を推定する固
有名詞属性判定部である。11は未登録語仮範囲検出部
8や固有名詞属性判定部10により未登録語の範囲を決
定する未登録語検出制御部である。また、12は形態素
解析の結果を記憶する解析結果記憶部、13は出力表示
部、14は日本語形態素解析を制御する形態素解析制御
部である。
Reference numeral 6 represents a connection table having information on whether or not connections between words are possible. 7 is a dictionary that sequentially compares the character string to be analyzed in the input character string with the prefix table 5, the co-occurrence word table 4, and the word dictionary 3 from the beginning to search for the word that matches the character string to be analyzed the longest. This is a table search section. Reference numeral 8 denotes an unregistered word provisional range detection unit that detects a provisional unregistered character string range using the method described later when the dictionary search fails, and 9 represents a proper noun that detects a proper noun from the unregistered character string using the method described later. The noun detection unit and further 10 are a proper noun attribute determination unit that estimates the attributes of the proper noun from the information of the co-occurring words detected by the proper noun detection unit 9. Reference numeral 11 denotes an unregistered word detection control section that determines the range of unregistered words using the unregistered word provisional range detection section 8 and the proper noun attribute determination section 10. Further, 12 is an analysis result storage unit that stores the results of morphological analysis, 13 is an output display unit, and 14 is a morphological analysis control unit that controls Japanese morphological analysis.

上記のように構成された装置において、以下、その動作
について説明する。まず、入力された日本語文字列に対
し先頭から解析を行う。解析対象となる文字列の先頭か
ら辞書・テーブル検索部7の辞書検索部が単語辞書3を
検索し、最も長く一致する単語を検索する。単語が抽出
できたら、その単語の次の文字から再び検索を行い単語
を抽出する。そして、次の単語が抽出できたら接続表6
を用いて前の$ g14と後ろの単語が接続可能である
かどうかをチエツクし、接続可能であればその部分の解
析は正しいとして次の単語の切り出しを行う。ここで、
解析対象文字列に相当する単語か辞書中で見つからない
かあるいは接続表によって前の単語と接続が不可能であ
るということになればそこに未登録語があると判断する
。そして、未登録語が存在すると判断したところから、
未登録語処理に入る。
The operation of the apparatus configured as described above will be described below. First, the input Japanese character string is analyzed from the beginning. The dictionary search section of the dictionary/table search section 7 searches the word dictionary 3 from the beginning of the character string to be analyzed, and searches for the longest matching word. Once a word has been extracted, the search is performed again starting with the next character of the word to extract the word. Then, when the next word is extracted, connection table 6
is used to check whether the previous $g14 and the following word can be connected, and if they are connectable, the analysis of that part is correct and the next word is extracted. here,
If the word corresponding to the character string to be analyzed cannot be found in the dictionary, or if the connection table indicates that it cannot be connected to the previous word, it is determined that there is an unregistered word. Then, from the point where it is determined that there is an unregistered word,
Enters unregistered word processing.

以下、第2図のフローチャートを参照しながら説明を行
う。未登録語検出制御部11は、未登録語仮範囲検出部
8により第3図の基準に従って仮の未登録語範囲を決定
する。さらに11は先の基準に従って設定された未登録
語仮範囲に対して、固有名詞検出部9によりステップs
1で仮の未登録語範囲の最後尾から、辞書・テーブル検
索部7により第4図の共起単語テーブルに登録されてい
る語を検索する。ステップs1で一致する単語が見つか
ったら、ステップslaに移行しさらにその一つ前の文
字を見て、第5図の接頭辞テーブルに登録されている語
があるかどうかを調べる。該当するものがあれば、ステ
ップslbでその文字より前の部分を未登録語とする。
The following description will be given with reference to the flowchart shown in FIG. The unregistered word detection control section 11 determines a tentative unregistered word range using the unregistered word tentative range detection section 8 according to the criteria shown in FIG. Furthermore, step s 11 is performed by the proper noun detection unit 9 for the tentative range of unregistered words set according to the above criteria.
1, the dictionary/table search unit 7 searches for words registered in the co-occurrence word table shown in FIG. 4 from the end of the tentative unregistered word range. If a matching word is found in step s1, the process moves to step sla and the character immediately before it is looked at to see if there is a word registered in the prefix table of FIG. 5. If there is a corresponding character, the part before that character is set as an unregistered word in step slb.

接頭辞に該当するものがなければ、ステップslcで先
頭から共起単語テーブル登録語の前までを未登録語とす
る。
If there is no matching prefix, in step slc, the words from the beginning to before the registered word in the co-occurrence word table are treated as unregistered words.

ステップsldでは固有名詞属性判定部10が共起単語
テーブルの「属性」情報から、検出した固有名詞の属性
を決定し、解析結果記憶部12にその情報を出力する。
In step sld, the proper noun attribute determination unit 10 determines the attribute of the detected proper noun from the “attribute” information in the co-occurrence word table, and outputs the information to the analysis result storage unit 12.

ステップs2で共起単語が有るか否かを判別し、なかっ
た場合は、ステップS2aで未登録語仮範囲の一つ前の
単語を取り出し、共起単語テーブルと照合する。一致す
るものがあればステップs2aで仮範囲全体を未登録語
とし、ステップsldに移行する。これは固有名詞の前
に共起単語が来る場合を考慮した処理である。固有名詞
の前にも共起単語が見つからなかった場合は、ステップ
s26で仮範囲全体を未登録語とする。
In step s2, it is determined whether or not there is a co-occurring word. If there is no co-occurring word, in step S2a, the previous word in the tentative unregistered word range is extracted and compared with the co-occurring word table. If there is a match, the entire temporary range is set as an unregistered word in step s2a, and the process moves to step sld. This process takes into consideration the case where a co-occurring word comes before a proper noun. If no co-occurring word is found before the proper noun, the entire tentative range is set as an unregistered word in step s26.

さらに、具体例を挙げて実際の処理を詳しく説明する。Furthermore, actual processing will be explained in detail by giving a specific example.

次のような例文を考える。「・・・は中骨根元首相が1
2月25日に北京で・・・」この例で、「中」のところ
で辞書引きに失敗したとすると未登録語仮範囲検出部8
が第3図の基準により「中骨根元首相」を仮範囲と判定
する。この部分に対して文字列の最後尾から辞書・テー
ブル検索部により第4図の共起単語テーブルを検索し「
首相」が発見される(ステップ51)0そこで「首相」
の前の文字「元」で第5図の接頭辞テーブルを検索する
と一致するものが見つかる。従って、未登録語検出制御
部11によってこの場合の未登録語は「中曽根」の部分
であると判断されることになる(ステップ5lb)。さ
らに、テーブル中で「首相」の属性は「人名」となって
いるので、この結果が解析結果記憶部12に、第6図の
ように出力される(ステップs 1 d ) oもう一
つ別の例を考える。「・・・は元首相中曽根が・・・」
上記の例文で、先と同じようにr中コのところで辞書引
きに失敗したとすると、仮範囲は「中曽根」でありこれ
に対して後ろから共起単語の検索に入るがこの場合は見
つからない!(ステップs2)。そこで、この範囲の前
の単語「首相Jを共起単語テーブルで検索すると一致す
るものが見つかる。そこで「中曽根」が未登録語とされ
、「首相」の属性「人名」が付与されて解析結果記憶部
12に、この結果が出力される。
Consider the following example sentences. ``...is the former Prime Minister Nakabonene.
In Beijing on February 25th...'' In this example, if the dictionary lookup fails at ``中'', the unregistered word provisional range detection unit 8
Based on the criteria shown in Figure 3, we determine that ``the prime minister of the middle bone root'' is the provisional range. For this part, the dictionary/table search unit searches the co-occurrence word table in Figure 4 from the end of the character string, and
"Prime Minister" is found (step 51) 0 there "Prime Minister"
If you search the prefix table in Figure 5 for the character ``Gen'' before ``, a match will be found. Therefore, the unregistered word detection control unit 11 determines that the unregistered word in this case is the part of "Nakasone" (step 5lb). Furthermore, since the attribute of "Prime Minister" in the table is "Person's name", this result is output to the analysis result storage unit 12 as shown in Fig. 6 (step s1d). Consider the example of "...is former Prime Minister Nakasone..."
In the example sentence above, if the dictionary look-up fails at the middle letter r as before, the tentative range is ``Nakasone'', and a search for co-occurring words starts after this, but in this case it is not found. ! (Step s2). Therefore, if you search for the word "Prime Minister J" in front of this range in the co-occurrence word table, you will find a match.Therefore, "Nakasone" is treated as an unregistered word, and the attribute "Person's name" of "Prime Minister" is added to the analysis result. This result is output to the storage unit 12.

[発明の効果] 以上のように本発明は、固有名詞に接続する単語と該単
語が接続する固有名詞の属性を記憶した共起単語テーブ
ルとを設け、辞書に未登録の語があった場合、その語に
接続する語で共起単語テーブルを検索し、該未登録語の
属性情報を得るようにしたので、的確な未登録語範囲の
判定および、検出した固有名詞の属性を推定することが
できる。
[Effects of the Invention] As described above, the present invention provides a co-occurrence word table that stores words connected to proper nouns and attributes of proper nouns to which the words are connected, so that when there is a word that is not registered in the dictionary, , the co-occurrence word table is searched for words connected to that word and the attribute information of the unregistered word is obtained, making it possible to accurately determine the range of unregistered words and estimate the attributes of the detected proper noun. I can do it.

【図面の簡単な説明】[Brief explanation of drawings]

第1図は本発明の一実施例における日本語解析装置のブ
ロック図、第2図は本発明の一実施例(こおける制御手
順を示すフローチャート、第3図は未登録文字列仮範囲
の決定基準を示す図、第4図は共起単語テーブルを示す
図、第5図は接頭辞テーブルを示す図、第6図は本発明
の一実施例における出力結果の例を示す図である。 1・・・入力手段、2・・・入力文字列記憶部、3・・
・単語辞書部、4・・・共起単語テーブル、5・・・接
頭辞テーブル、6・・・接続テーブル、7・・・辞書・
テーブル検索部、8・・・未登録語仮範囲検出部、9・
・・固有名詞検出部、10・・・固有名詞属性判定部、
11・・・未登録語検出制御部 1 2・・・解析結果記憶部、 第 図 3・・・出力表示部、 4・・・形態素解析制御部。
Fig. 1 is a block diagram of a Japanese language analysis device according to an embodiment of the present invention, Fig. 2 is a flowchart showing a control procedure in an embodiment of the present invention (Fig. 3 is a flowchart showing the control procedure in this embodiment), and Fig. 3 is a block diagram of a Japanese language analysis device according to an embodiment of the present invention. 4 is a diagram showing a co-occurrence word table, FIG. 5 is a diagram showing a prefix table, and FIG. 6 is a diagram showing an example of an output result in an embodiment of the present invention.1 ...Input means, 2...Input character string storage section, 3...
・Word dictionary part, 4...Co-occurrence word table, 5...Prefix table, 6...Connection table, 7...Dictionary・
Table search unit, 8... Unregistered word temporary range detection unit, 9.
... Proper noun detection unit, 10... Proper noun attribute determination unit,
11... Unregistered word detection control unit 1 2... Analysis result storage unit, Figure 3... Output display unit, 4... Morphological analysis control unit.

Claims (1)

【特許請求の範囲】 日本語文を構成する文字列データを入力する入力手段と
、 前記入力手段から入力された文字列データを記憶する記
憶手段と、 単語と該単語の属性データを記憶した単語辞書と、 固有名詞に接続する単語と、該単語が接続する固有名詞
の属性を記憶した共起単語テーブルと、前記記憶手段に
記憶された文字列データを前記単語辞書を参照して解析
する形態素解析手段と、前記形態素解析手段によって解
析不可能で共起単語テーブルに記憶された語に接続した
文字列の属性を前記共起テーブルを参照して得る手段と
、を有することを特徴とする日本語解析装置。
[Scope of Claims] An input means for inputting character string data constituting a Japanese sentence, a storage means for storing the character string data input from the input means, and a word dictionary storing words and attribute data of the words. a co-occurrence word table storing words connected to proper nouns and attributes of proper nouns connected to the words; and morphological analysis for analyzing character string data stored in the storage means with reference to the word dictionary. and means for obtaining attributes of character strings connected to words that cannot be analyzed by the morphological analysis means and are stored in the co-occurrence word table by referring to the co-occurrence table. Analysis device.
JP2063234A 1990-03-14 1990-03-14 Japanese language analyzer Pending JPH03263260A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2063234A JPH03263260A (en) 1990-03-14 1990-03-14 Japanese language analyzer

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP2063234A JPH03263260A (en) 1990-03-14 1990-03-14 Japanese language analyzer

Publications (1)

Publication Number Publication Date
JPH03263260A true JPH03263260A (en) 1991-11-22

Family

ID=13223325

Family Applications (1)

Application Number Title Priority Date Filing Date
JP2063234A Pending JPH03263260A (en) 1990-03-14 1990-03-14 Japanese language analyzer

Country Status (1)

Country Link
JP (1) JPH03263260A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2001216300A (en) * 2000-01-31 2001-08-10 Just Syst Corp Authorization device and authorization method for individual name, and recording medium
JP2002259368A (en) * 2001-03-01 2002-09-13 Nippon Telegr & Teleph Corp <Ntt> Method and device for working document cipher, document cipher working processing program and recording medium therefor

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2001216300A (en) * 2000-01-31 2001-08-10 Just Syst Corp Authorization device and authorization method for individual name, and recording medium
JP2002259368A (en) * 2001-03-01 2002-09-13 Nippon Telegr & Teleph Corp <Ntt> Method and device for working document cipher, document cipher working processing program and recording medium therefor

Similar Documents

Publication Publication Date Title
US5890103A (en) Method and apparatus for improved tokenization of natural language text
US6115683A (en) Automatic essay scoring system using content-based techniques
WO1997004405A9 (en) Method and apparatus for automated search and retrieval processing
JP2000259635A (en) Translation device, translation method and recording medium storing translation program
JPH03263260A (en) Japanese language analyzer
JPS6368972A (en) Unregistered word processing system
JPH05233686A (en) Japanese language processor
JPS62262178A (en) Language analyzing device
JP2838849B2 (en) Language processor
JPH04213164A (en) Dictionary consulting system
JPH07200592A (en) Text processor
JPH04177463A (en) Language processing method and device
JPS59100943A (en) Kana (japanese syllabary)-kanji (chinese character) converter
JPH04268669A (en) Japanese language analyzer
JPH02105968A (en) Automatic test and correction system for japanese sentence error
JPH01232471A (en) Morpheme analyzer
JPH09319746A (en) Document analysis method and device
JPH03127173A (en) Japanese morpheme analyzing method
JPH052602A (en) Working-over support system
JPH09128393A (en) Machine translation device
JPS63136264A (en) Mechanical translating device
JPH0452963A (en) Japanese language morpheme analyzer
JPS6395570A (en) Language analysis system
JPH01316863A (en) Automatic qualifying and correcting device for error in japanese language text
JPH0239357A (en) Automatic checking/correcting device for japanese sentence